The Foundations 
of Statistics 


LEONARD J. SAVAGE 


Late Eugene Higgins Professor of Statistics 
Yale University 


SECOND REVISED EDITION 


DOVER PUBLICATIONS, INC. 
NEW YORK 


Copyright © 1972 by Dover Publications, Inc. 

Copyright © 1954 by I. Richard Savage. 

All rights reserved under Pan American and Inter- 
national Copyright Conventions. 


This Dover edition, first published in 1972, is a 
revised and enlarged version of the work originally 
published by John Wiley & Sons in 1954. 


International Standard Book Number: 0-486-62349-1 
Library of Congress Catalog Card Number: 79-188245 


Manufactured in the United States of America 
Dover Publications, Inc. 
180 Varick Street 
New York, N.Y. 10014 


TO MY FATHER 


Preface to the Dover Edition 


CONTINUING INTEREST HAS ENCOURAGED PUBLICATION OF A SECOND 
edition of this book. Because revising it to fit my present thinking and 
the new climate of opinion about the foundations of statistics would 
obliterate rather than restore, I have limited myself in the preparation 
of this edition much as though dealing with the work of another. 

The objective errors that have come to my attention, mainly through 
the generosity of readers, of whom Peter Fishburn has my special 
thanks, have been corrected, of course. Minor and mechanical ones, such 
as a name misspelled or an inequality that had persisted in pointing in 
the wrong direction, have been silently eliminated. Other changes are 
conspicuous as additions. They consist mainly of this Preface, Appendix 
4: Bibliographic Supplement, and several footnotes identified as new 
by the signt. To enable you to pursue the many new developments 
since 1954 according to the intensity and direction of your own 
interests, a number of new references leading to many more are listed in 
the Bibliographic Supplement, and the principle advances known to me 
are pointed out in new footnotes or in comments on the new references. 

Citations to the bibliography in the original Appendix 3 are made 
by a compact, but otherwise ill-advised, letter and number code; those 
to the new Appendix 4 are made by a now popular system, which is 
effective, informative, and flexible. Example: The historic papers (Borel 
1924) and [D2] have been translated by Kyburg and Smokler (1964). 

The following paragraphs are intended to help you approach 
this book with a more current perspective. To some extent, they will be 
intelligible and useful even to a novice in the foundations of statistics, 
but they are necessarily somewhat technical and will therefore take on 
new meaning if you return to them as your reading in this book and 
elsewhere progresses. 

The book falls into two parts. The first, ending with Chapter 7, is a 
general introduction to the personalistic tradition in probability and 
utility. Were this part to be done over, radical revision would not be 
required, though I would now supplement the line of argument center- 
ing around a system of postulates by other less formal approaches, each 
convincing in its own way, that converge to the general conclusion that 
personal (or subjective) probability is a good key, and the best yet 
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known, to all our valid ideas about the applications of probability. There 
would also be many new works to report on and analyze more thoroughly 
than can be done in footnotes. 

The original aim of the second part of the book, beginning with 
Chapter 8, is all too plainly stated in the second complete paragraph on 
page 4. There, a personalistic justification is promised for the popular 
body of devices developed by the enthusiastically frequentistic schools 
that then occupied almost the whole statistical scene and still dominate 
it, though less completely. The second part of the book is indeed devoted 
to personalistic discussion of frequentistic devices, but for one after 
another it reluctantly admits that justification has not been found. 
Freud alone could explain how the rash and unfulfilled promise on 
page 4 went unamended through so many revisions of the manuscript. 

Today, as I see it, the theory of personal probability applied to sta- 
tistics shows that many of the prominent frequentistic devices can at 
best lead to accidental and approximate, not systematic and cogent, suc- 
cess, as Is expanded upon, perhaps more optimistically, by Pratt (1965). 
Among the ill-founded frequentistic devices are minimax rules, almost 
all tail-area tests, tolerance intervals, and, in a sort of class by itself, 
fiducial probability. 

If I have lost faith in the devices of the frequentistic schools, I have 
learned new respect for some of their general theoretical ideas. Let me 
amplify first in connection with the Neyman-Pearson school. While 
insisting on long-run frequency as the basis of probability, that school 
wisely emphasizes the ultimate subjectivity of statistical inference or 
behavior within the objective constraint of ‘‘admissibility,’’ as in (Leh- 
mann 1958; Wolfowitz 1962). But careful study of admissibility leads 
almost inexorably to the recognition of personal probabilities and their 
central role in statistics (Savage 1961, Section 4; 1962, pp. 170-175), 
So personalistic statistics appears as a natural late development of the 
Neyman-Pearson ideas. 

One consequence of this sort of analysis of admissibility is the ex- 
tremely important likelihood principle, a corollary of Bayes’ theorem, 
of which I was not even aware when writing the first edition of this book. 
This principle, inferable from, though nominally at variance with, 
Neyman-Pearson ideas (Birnbaum 1962), was first put forward by 
Barnard (1947) and by Fisher (1955), members of what might be 
called the Fisher school of frequentists. See also (Barnard 1965; Bar- 
nard et al. 1962; Cornfield 1966). 

The views just expressed are evidently controversial, and if I have 
permitted myself such expressions as ‘‘show’’ and ‘‘inexorably,’’ they 
are not meant with mathematical finality. Yet, controversial though 
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they may be, they are today shared by a number of statisticians, who 
may be called personalistic Bayesians, or simply personalists. This book 
has played—and continues to play—a role in the personalistic move- 
ment, but the movement itself has other sources apart from those from 
which this book itself was drawn. One with great impact on practical 
statistics and scientific management is a book by Robert Schlaifer 
(1959). This is a welcome opportunity to say that his ideas were devel- 
oped wholly independently of the present book, and indeed of other 
personalistic literature. They are in full harmony with the ideas in 
this book but are more down to earth and less spellbound by tradition. 


L. J. SAVAGE 
Yale Unwersity 
June, 1971 


Preface to the First Edition 


A BOOK ABOUT SO CONTROVERSIAL A SUBJECT AS THE FOUNDATIONS 
of statistics may have some value in the classroom, as I hope this one 
will; but it cannot be a textbook, or manual of instruction, stating the 
accepted facts about its subject, for there scarcely are any. Openly, or 
coyly screened behind the polite conventions of what we call a disinter- 
ested approach, it must, even more than other books, be an airing of 
its author’s current opinions. 

One who so airs his opinions has serious misgivings that (as may be 
judged from other prefaces) he often tries to communicate along with 
his book. First, he longs to know, for reasons that are not altogether 
noble, whether he is really making a valuable contribution. His own 
conceit, the encouragement of friends, and the confidence of his pub- 
lisher have given him hope, but he knows that the hopes of others in 
his position have seldom been fully realized. 

Again, what he has written is far from perfect, even to his biased 
eye. He has stopped revising and called the book finished, because 
one must sooner or later. 

Finally, he fears that he himself, and still more such public as he 
has, will forget that the book is tentative, that an author’s most recent 
word need not be his last word. 

The application of statistics interests some workers in almost every 
field of empirical investigation—not only in science, but also in com- 
merce and industry. Moreover, the foundations of statistics are con- 
nected conceptually with many disciplines outside of statistics itself, 
particularly mathematics, philosophy, economics, and psychology—a 
situation that, incidentally, must augment the natural misgivings of 
an author in this field about his own competence. Those who read in 
this book may, therefore, be diverse in background and interests. With 
this consideration in mind, I have endeavored to keep the book as free 
from technical prerequisites as its subject matter and its restriction to 
a reasonable size permit. 

Technical knowledge of statistics is nowhere assumed, but the reader 
who has some general knowledge of statistics will be much better pre- 
pared to understand and appraise this book. The books Statzstics, by 
L. H. C. Tippett, and On the Principles of Statistical Inference by 
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A. Wald, listed in the Bibliography at the end of Appendix 3, are short 
authoritative introductions to statistics, either of which would provide 
some statistical background for this book. The books of Tippett and 
Wald are so different in tone and emphasis that it would by no means 
be wasteful to read them both, in that order. 

Any but the most casual reader should have some formal preparation 
in the theory of mathematical probability. Those acquainted with 
moderately advanced theoretical statistics will automatically have this 
preparation; others may acquire it, for example, by reading Theory of 
Probability, by M. E. Munroe, or selected parts of An Introduction to 
Probability Theory and Its Applications, by W. Feller, according to 
their taste. In Feller’s book, a thorough reading of the Introduction 
and Chapter 1, and a casual reading of Chapters 5, 7, and 8 would be 
sufficient. 

The explicit mathematical prerequisites are not great; a year of cal- 
culus would in principle be more than enough. But, in practice, read- 
ers without some training in formal logic or one of the abstract branches 
of mathematics usually taught only after calculus will, I fear, find some 
of the long though elementary mathematical deductions quite forbid- 
ding. For the sake of such readers, I therefore take the liberty of giv- 
ing some pedagogical advice here and elsewhere that mathematically 
more mature readers will find superfluous and possibly irritating. In 
the first place, it cannot be too strongly emphasized that a long mathe- 
matical argument can be fully understood on first reading only when it 
is very elementary indeed, relative to the reader’s mathematical knowl- 
edge. If one wants only the gist of it, he may read such material once 
only; but otherwise he must expect to read it at least once again. Seri- 
ous reading of mathematics is best done sitting bolt upright on a hard 
chair at a desk. Pencil and paper are nearly indispensable; for there 
are always figures to be sketched and steps in the argument to be veri- 
fied by calculation. In this book, as in many mathematical books, 
when exercises are indicated, it is absolutely essential that they be 
read and nearly essential that they be worked, because they constitute 
part of the exposition, the exercise form being adopted where it seems 
to the author best for conveying the particular information at hand. 

To some mathematicians, and even more to logicians, I must say a 
word of apology for what they may consider lapses of rigor, such as 
using the same symbol with more than one meaning and failing to dis- 
tinguish uniformly between the use and the mention of a symbol; but 
they will understand that these lapses are sacrifices to what I take to 
be general intelligibility and will have, I hope, no real difficulty in re- 
pairing them. 
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Few will wish to read the whole book; therefore introductions to the 
chapters and sections have been so written as not only to provide orien- 
tation but also to facilitate skipping. In particular, safe detours are 
indicated around mathematically advanced topics and other digressions. 

A few words in explanation of the conventions, such as those by which 
internal and external references are made in this book, may be useful. 

The abbreviation § 3.4 means Section 4 of Chapter 3; within Chapter 
3 itself, this would be abbreviated still further to § 4. The abbreviation 
(3.4.1) means the first numbered and displayed equation or other ex- 
pression in § 3.4; within Chapter 3, this would be abbreviated still 
further to (4.1) and within § 3.4 simply to (1). Theorems, lemmas, 
exercises, corollaries, figures, and tables are named by a similar system, 
e.g., Theorem 3.4.1, Theorem 4.1, Theorem 1. Incidentally, the proofs 
of theorems are terminated with the special punctuation mark @, a 
device borrowed from Halmos’s Measure Theory. 

Seven postulates, Pl, P2, etc., are introduced over the course of 
several chapters. For ready reference these are, with some explanatory 
material, reproduced on the end papers. 

Entries in the Bibliography at the end of Appendix 3 are designated 
by a self-explanatory notation in square brackets. For example, the 
works of Tippett, Wald, Munroe, Feller, and Halmos, already referred 
to, are [T2], [W1], [M6], [F1], and [H2], respectively. 

I often allude to a set of key references to a given topic. This means 
a set of external references intended to lead the reader that wishes to 
pursue that particular topic to the fullest and most recent bibliographies; 
it has nothing to do with the merit or importance of the works referred to. 

Technical terms (except for non-verbal symbols) that are defined in 
this book are printed in bold face or italics (depending on the impor- 
tance of the term for this book or for established usage) in the context 
where the term is defined. These special fonts are occasionaily used 
for other purposes as well. Terms are sometimes used informally— 
even in unofficial definitions—before being officially defined. Even the 
official definitions are sometimes of necessity very loose, corresponding 
to the well-known principle that, in a formal theory, some terms must 
in strict logic be left undefined. 

L. J. SAVAGE 

University of Chicago 

April, 1954 


Acknowledgement 


I HAVE MANY FRIENDS, FEW OF WHOM SHARE MY PRESENT OPIN- 
ions, to thank for criticism and encouragement. Though the list seems 
long, I cannot refrain from explicitly mentioning: I. Bross, A. Burks, 
R. Carnap, B. de Finetti, M. Flood, I. J. Good, P. R. Halmos, O. Hel- 
mer, C. Hildreth, T. Koopmans, W. Kruskal, C. F. Mosteller, I. R. 
Savage, W. A. Wallis, and M. A. Woodbury. Wallis as chairman of 
my department and close friend has particularly encouraged me to 
write the book and facilitated my doing so in many ways. Mrs. Janet 
Lowrey and Miss Louise Forsyth typed and retyped and did so many 
other painstaking tasks so well that it would be inadequate to call 
their help secretarial. 

My work on the book was made possible by four organizations to 
which I herewith express thanks. During the years 1950 through 1954 
I worked on it at the University of Chicago, where the work was sup- 
ported by the Office of Naval Research and the University itself, which 
also supported it during the summer of 1952. During the academic 
year 1951-52 I worked on it as a research scholar in France under the 
Fulbright Act (Public Law 584, 79th Congress), and during the whole 
of that year as a fellow of the John Simon Guggenheim Memorial 


Foundation. 
L. J. S. 


Contents 


Postulates of a personalistic theory of decision . 


1. INTRODUCTION 


1. 
2. 
3. 


The role of foundations . 
Historical background 
General outline of this book . 


. End papers 


2. PRELIMINARY CONSIDERATIONS ON DECISION IN THE FAcrE oF UNCERTAINTY 


. Introduction . 
The person . . 
The world, and states ae ‘he world ‘ 
Events 
. Consequences, ante: aud aeaiions 
. The simple ordering of acts with respect to iphereuce: 
. The sure-thing principle . 


3. PERSONAL PROBABILITY 


. Introduction . 


Qualitative personal ‘probability: 


. Quantitative personal probability 


Some mathematical details 


. Conditional probability, qualitative and nilinntitative 
. The approach to certainty through experience 


Symmetric sequences of events . 


4, CriItTIcAL COMMENTS ON PERSONAL PROBABILITY 


Oo Rm © Ne 


. Introduction . 
. Some shortcomings of hie Aciconalistie view . 
. Connection with other views . 

. Criticism of other views . 

. The role of symmetry in Srobability: , 
. How can science use a personalistic view of probability? 


5. UOTriiry 


1. 
. Gambles 

. Utility, and preieranice: among peambles 

. The extension of utility to more ig acts . 
. Small worlds ; 

. Historical and critical norminents on tility 


oO OF em ® bb 


Introduction . 


xili 


Xiv CONTENTS 


6. OBSERVATION 


1. Introduction. ... Bide, os ALAR 23" 6 ve yah cde py pa his on te Oe ey cag 


2. What an observationis ...........2.. 


3. Multiple observations, and ecransions a observations and oe seta i 


acts .... a Ai th te eee on 


4. Dominance and cdaissibiliey: ie es Gea, Aa ee a 
5. Outline of the design of experiments ............. 


7. PARTITION PROBLEMS 
. Introduction. ...... 


. Extension of observations, and siifficient statistics . .... 


CON OOF WN eH 


8. STATISTICS PROPER 


1. Introduction. ... Se ee te A eo Be Mh Gs ide Baek 2 Bes 
2. What is statistics nioher ge AS Bs cae eh ae Ew De ee ee 
3. Multipersonal problems ..............0.2 2.884 
4. The minimax theory .......... 0... 08 2 eee 


9. INTRODUCTION To THE MINIMAax THEORY 
. Introduction. .......2.2.2.2.. ee ar oe ae ee 


. Income and loss 

. The minimax rule, and the principle ef admissibility: 

. Illustrations of the minimax rule . See se 

. Objectivistic motivation of the minimax rule , 
8. Loss as opposed to negative income in the minimax rule 


NOOO © Ne 


10. A PERSONALISTIC REINTERPRETATION OF THE MINIMAX THEORY 


1. Introduction. .. . Se oe ee ee a ee ee 


2. A model of group ee ; 


3. The group minimax rule, and the BOND Sanals of aA eaicsibility 


4, Critique of the group minimax rule . 


11. Tot PARALLELISM BETWEEN THE MINIMAX THEORY AND THE THEORY OF 


Two-PERsSON GAMES 


Ty IntroOaduetion:.% < se a oe eee i oa ee ae ol ee ee ee 


2. Standard games 
3. Minimax play . 


4. Parallelism and contrast wih ‘hes minimax éheones te ie dete RO 


12. THe Matuematics or Minimax PROBLEMS 
1. Introduction. .........2.2.+. 


2. Abstract games . 2... 1 1 ee ee ete th tt 


. Structure of (twofold) partition problens ee ee ee 
. The value of observation .........2.2.. Nt ae ee ee 


. Likelihood ratios... 2... 1 2 we ee 

. Repeated observations .. . Bie Ge. Bots Ge ar. 
. Sequential probability ratio en ee br hed to at ee bo UA Be the ae Roe 
. Standard form, and absolute comparison between ebesrvaons és cata ns 


. The behavioralistic outlook .......2.2.2.202.222020848 
. Mixed acts ........2.2.222828884 de ey Dy can 


CONTENTS 


3. Bilinear games... . 2... 1 ee ee 
4, Anexample of a bilinear game. .........2.2..008484 
5. Bilinear games exhibiting symmetry ............46. 


13. OBJECTIONS TO THE MINIMAX RULES 

. Introduction. . .. 1... 1 ew eee ee ee ke 
. A confusion between loss and negative income .......... 
. Utility and the minimax rule ..........2..2.2.2204.2. 
. Almost sub-minimax acts .........4..680880848 


oP ON 


14. THe Minimax THorory APPLIED TO OBSERVATIONS 


. Introduction. ............. Sate DD ces Sede. gh ans. Sei 
. Recapitulation of partition problems ...... aS Hicgea ede 2k 
. Sufficient statistics . 2. 2... 

Simple dichotomy, anexample..........2..2.2..0404 
The approach to certainty. . ..... 0... ee ee ee 
Cost of observation. . 2... 1. ee 
. Sequential probability ratio procedures ........2.2..2.2. 
. Randomization .... 1... we ee ee ee 


OOWHS TP WN 


15. Pornr EsTIMATION 


=p SPCPOGUCHION. 5. ce i: ak: AOS, Yr ee ee SE Re ee Boe OE Bd 
. The verbalistic concept of point estimation ........2... 
. Examples of problems of point estimation. .........2.. 
. Criteria that have been proposed for point estimates ....... 
. A behavioralistic review of the criteria for point estimation 

. A behavioralistic review, continued. .......2.2.2.2.4204.. 
. A behavioralistic review, concluded. ........2..2.2.4.42. 


I Oooh WN — 


16. TESTING 


Ve INtrOdUChiON: «x, 's: 52d we ac ee SR Se es SS Se 8 SS 
2. Atheory of testing... 1... 1... eee ee 
3. Testing in practice... .. 1. 1 ee ee 


17. INTERVAL ESTIMATION AND RELATED Topics 


1. Estimates of the accuracy of estimates ........2..2.2.. 
2. Interval estimation and confidence intervals ....... 

3. Tolerance intervals... 2... 1. 1 wee ee ee 
4, Fiducial probability ....... bs Si" be Be RL AE ie, oe Gs ee, a a 


APPENDIX 4. BIBLIOGRAPHIC SUPPLEMENT ...........4.. 
SECHNICAL SYMBOLS: 3 4. 4 @ 4 4M 4 ww ede SO Re ES A eee ee 
AUTHOR INDEX. 4. geiqh sh: 2, boar Oe, Ae RN Gee, Swed, GS ee Se Se ee eS 
GENERAL INDEX 2) we 6: dh. SB sale we Le ae Se te cer ee et eo Be & SS 


CHAPTER 1 


Introduction 


1 The role of foundations 


It is often argued academically that no science can be more secure 
than its foundations, and that, if there is controversy about the foun- 
dations, there must be even greater controversy about the higher parts 
of the science. As a matter of fact, the foundations are the most con- 
troversial parts of many, if not all, sciences. Physics and pure mathe- 
matics are excellent examples of this phenomenon. As for statistics, 
the foundations include, on any interpretation of which I have ever 
heard, the foundations of probability, as controversial a subject as one 
could name. As in other sciences, controversies over the foundations 
of statistics reflect themselves to some extent in everyday practice, but 
not nearly so catastrophically as one might imagine. I believe that 
here, as elsewhere, catastrophe is avoided, primarily because in prac- 
tical situations common sense generally saves all but the most pedantic 
of us from flagrant error. It is hard to judge, however, to what extent 
the relative calm of modern statistics is due to its domination by a 
vigorous school relatively well agreed within itself about the foundations. 

Although study of the foundations of a science does not have the 
role that would be assigned to it by naive first-things-firstism, it has a 
certain continuing importance as the science develops, influencing, and 
being influenced by, the more immediately practical parts of the science. 


2 Historical background 


The concept and problem of inductive inference have been promi- 
nent in philosophy at least since Aristotle. Mathematical work on some 
aspects of the problem of inference dates back at least to the early 
eighteenth century. Leibniz is said to be the first to publish a sugges- 
tion in that direction, but Jacob Bernoulli’s posthumous Ars Conjec- 
tandi (1713) [B12] seems to be the first concerted effort.{ This mathe- 


+ Valuable information on this and other topics of the early philosophic history of 
probability is attractively presented in Keynes’ treatise [K4], especially in Chapters 
VII, XXIII, and the bibliography. 
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matical work has always revolved around the concept of probability; 
but, though there was active interest in probability for nearly a cen- 
tury before the publication of Ars Conjectandi, earlier activity seems 
not to have been concerned with inductive inference. 

In the present century there has been and continues to be extra- 
ordinary interest in mathematical treatment of problems of inductive 
inference. For reasons I cannot and need not analyze here, this ac- 
tivity has been strikingly concentrated in the English-speaking world. 
It is known under several names, most of which stress some aspect of 
the subject that seemed of overwhelming importance at the moment 
when the name was coined. ‘Mathematical statistics,’ one of its 
earliest names, is still the most popular. In this name, “mathematical”’ 
seems to be intended to connote rational, theoretical, or perhaps mathe- 
matically advanced, to distinguish the subject from those problems of 
gathering and condensing numerical data that can be considered apart 
from the problem of inductive inference, the mathematical treatment 
of which is generally relatively trivial. The name “statistical inference’ 
recognizes that the subject is concerned with inductive inference. The 
name “statistical decision” reflects the idea that inductive inference is 
not always, if ever, concerned with what to believe in the face of in- 
conclusive evidence, but that at least sometimes it is concerned with 
what action to decide upon under such circumstances. Within this 
book, there will be no harm in adopting the shortest possible name, 
“statistics.” 

It is unanimously agreed that statistics depends somehow on proba- 
bility. But, as to what probability is and how it is connected with 
statistics, there has seldom been such complete disagreement and break- 
down of communication since the Tower of Babel. There must be 
dozens of different interpretations of probability defended by living 
authorities, and some authorities hold that several different interpreta- 
tions may be useful, that 1s, that the concept of probability may have 
different meaningful senses in different contexts. Doubtless, much of 
the disagreement is merely terminological and would disappear under 
sufficiently sharp analysis. Some believe that it would all disappear, 
or even that they have themselves already made the necessary 
analysis. 

Considering the confusion about the foundations of statistics, it is 
surprising, and certainly gratifying, to find that almost everyone is 
agreed on what the purely mathematical properties of probability are. 
Virtually all controversy therefore centers on questions of interpreting 
the generally accepted axiomatic concept of probability, that is, of de- 
termining the extramathematical properties of probability. 
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The widely accepted axiomatic concept referred to is commonly as- 
cribed to Kolmogoroff [K7] and goes by his name. It should be men- 
tioned that there is some dissension from it on the part of a small group 
led by von Mises [V2]. There are also a few minor technical variations 
on the Kolmogoroff system that are sometimes of interest; they will be 
discussed in § 3.4. 

I would distinguish three main classes of views on the interpretation 
of probability, for the purposes of this book, calling them objectivistic, 
personalistic, and necessary. Condensed descriptions of these three 
classes of views seem called for here. If some readers find these descrip- 
tions condensed to the point of unintelligibility, let them be assured 
that fuller ones will gradually be developed as the book proceeds. 

Objectivistic views hold that some repetitive events, such as tosses 
of a penny, prove to be in reasonably close agreement with the mathe- 
matical concept of independently repeated random events, all with the 
same probability. According to such views, evidence for the quality 
of agreement between the behavior of the repetitive event and the 
mathematical concept, and for the magnitude of the probability that 
applies (in case any does), is to be obtained by observation of some 
repetitions of the event, and from no other source whatsoever. 

Personalistic views hold that probability measures the confidence 
that a particular individual has in the truth of a particular proposition, 
for example, the proposition that it will rain tomorrow. These views 
postulate that the individual concerned is in some ways “reasonable,” 
but they do not deny the possibility that two reasonable individuals 
faced with the same evidence may have different degrees of confidence 
in the truth of the same proposition. 

Necessary views hold that probability measures the extent to which 
one set of propositions, out of logical necessity and apart from human 
opinion, confirms the truth of another. They are generally regarded 
by their holders as extensions of logic, which tells when one set of prop- 
ositions necessitates the truth of another. 

After what has been said about the intensity and complexity of the 
controversy over the probability concept, you must realize that the 
short taxonomy above is bound to infuriate any expert on the founda- 
tions of probability, but I trust it may do the less learned more good 
than harm. 

The great burst of statistical research in the English-speaking world 
in the present century has revolved around objectivistic views on the 
interpretation of probability. As will shortly be explained, any purely 
objectivistic view entails a severe difficulty for statistics. This diffi- 
culty is recognized by members of the British-American School, if I 
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may use that name without its being taken too literally or at all na- 
tionalistically, and is regarded by them as a great, though not insur- 
mountable, obstacle; indeed, some of them see it as the central problem 
of statistics. 

The difficulty in the objectivistic position is this. In any objecti- 
vistic view, probabilities can apply fruitfully only to repetitive events, 
that is, to certain processes; and (depending on the view in question) 
it is either meaningless to talk about the probability that a given propo- 
sition is true, or this probability can be only 1 or 0, according as the 
proposition is in fact true or false. Under neither interpretation can 
probability serve as a measure of the trust to be put in the proposition. 
Thus the existence of evidence for a proposition can never, on an ob- 
jectivistic view, be expressed by saying that the proposition is true with 
a certain probability. Again, if one must choose among several courses 
of action in the light of experimental evidence, it is not meaningful, in 
terms of objective probability, to compute which of these actions is 
most promising, that is, which has the highest expected income. Hold- 
ers of objectivistic views have, therefore, no recourse but to argue that 
it is not reasonable to assign probabilities to the truth of propositions 
or to calculate which of several actions is the most promising, and that 
the need expressed by the attempt to set up such concepts must be 
met in other ways, if at all. 

The British-American School has had great success in several re- 
spects. The number of its adherents has rapidly increased. It has con- 
tributed many procedures of strong intuitive appeal and (one feels) of 
lasting worth. These have found widespread application in many 
sciences, in industry, and in commerce. The success of the school may 
pragmatically be taken as evidence for the correctness of the general 
view on which it is based. Indeed, anyone who overthrows that view 
must either discredit the procedures to which it has led, or show, as 
I hope to show in this book, that they are on the whole consistent with 
the alternative proposed. 

Some, I among them, hold that the grounds for adopting an objec- 
tivistic view are not overwhelmingly strong; that there are serious log- 
ical objections to any such view; and, most important of all, that the 
difficulty a strictly objectivistic view meets in statistics reflects real 
inadequacy. 


3 General outline of this book 


This book presents a theory of the foundations of statistics which is 
based on a personalistic view of probability derived mainly from the 
work of Bruno de Finetti, as expressed for example in [D2]. The theory 
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is presented in a tentative spirit, for I realize that the serious blemishes 
in it apparent to me are not the only ones that will be discovered by 
critical readers. A theory of the foundations of statistics that appears 
contrary to the teaching of the most productive statisticians will prop- 
erly be regarded with extraordinary caution. Other views on proba- 
bility will, of course, be discussed in this book, partly for their own in- 
terest and partly to explain the relationship between the personalistic 
view on which this book is based and other views. 

The book is organized into seventeen chapters, of which the present 
introduction is the first. Chapters 2—7 are, so to speak, concerned with 
the foundations at a relatively deep level. They develop, explain, and 
defend a certain abstract theory of the behavior of a highly idealized 
person faced with uncertainty. That theory is shown to have as im- 
plications a theory of personal probability, corresponding to the per- 
sonalistic view of probability basic to this book, and also a theory of 
utility due, in its modern form, to von Neumann and Morgenstern 
[V4]. 

There is a transition, occurring in Chapter 8 and maintained through- 
out the rest of the book, to a shallower level of the foundations of sta- 
tistics; I might say from pre-statistics to statistics proper. In those 
later chapters, it is recognized that the theory developed in the earlier 
ones is too highly idealized for immediate application. Some compro- 
mises have to be made, and the appropriate ones are sought in an anal- 
ysis of some of the inventions and ideas of the British-American School. 
It will, I hope, be demonstrated thereby that the superficially incom- 
patible systems of ideas associated on the one hand with a personalistic 
view of probability and on the other with the objectivistically inspired 
developments of the British-American School do in fact lend each other 
mutual support and clarification. 


CHAPTER 2 


Preliminary Considerations 
on Decision in 
the Face of Uncertainty 


1 Introduction 


Decisions made in the face of uncertainty pervade the life of every 
individual and organization. Even animals might be said continually 
to make such decisions, and the psychological mechanisms by which 
men decide may have much in common with those by which animals 
do so. But formal reasoning presumably plays no role in the decisions 
of animals, little in those of children, and less than might be wished in 
those of men. It may be said to be the purpose of this book, and in- 
deed of statistics generally, to discuss the implications of reasoning for 
the making of decisions. 

Reasoning is commonly associated with logic, but it is obvious, as 
many have pointed out, that the implications of what is ordinarily 
called logic are meager indeed when uncertainty is to be faced. It has 
therefore often been asked whether logic cannot be extended, by prin- 
ciples as acceptable as those of logic itself, to bear more fully on un- 
certainty. An attempt to extend logic in this way will be begun in 
this chapter, differing in two important respects from most, but not 
all, other attempts. 

First, since logic is concerned with implications among propositions, 
many have thought it natural to extend logic by setting up criteria for 
the extent to which one proposition tends to imply, or provide evidence 
for, another. It seems to me obvious, however, that what is ultimately 
wanted is criteria for deciding among possible courses of action; and, 
therefore, generalization of the relation of implication seems at best a 
roundabout method of attack. It must be admitted that logic itself 
does lead to some criteria for decision, because what is implied by a 
proposition known to be true is in turn true and sometimes relevant to 
making a decision. Should some notion of partial implication be de- 
monstrably even better articulated with decision than is implication it- 
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self, that would be excellent; but how is such a notion to be sought ex- 
cept by explicitly studying decision? Ramsey’s discussion in [R1] of 
the point at issue here is especially forceful. 

Second, it is appealing to suppose that, if two individuals in the same 
situation, having the same tastes and supplied with the same informa- 
tion, act reasonably, they will act in the same way. Such agreement, 
belief in which amounts to a necessary (as opposed to a personalistic) 
view of probability, is certainly worth looking for. Personally, I be- 
lieve that it does not correspond even roughly with reality, but, hav- 
ing at the moment no strong argument behind my pessimism on this 
point, I do not insist on it. But I do insist that, until the contrary be 
demonstrated, we must be prepared to find reasoning inadequate to 
bring about complete agreement. In particular, the extensions of logic 
to be adduced in this book will not bring about complete agreement; 
and whether enough additional principles to do so, or indeed any addi- 
tional principles of much consequence, can be adduced, I do not know. 
It may be, and indeed I believe, that there is an element in decision 
apart from taste, about which, like taste itself, there is no disputing. 

The next four sections of this chapter build up a formal model, or 
scheme, of the situation in which a person is faced with uncertainty; 
the final two, in terms of this model, motivate and state some of the 
few principles that seem to me entitled to be taken as postulates for 
rational decision. 


2 The person 


I am about to build up a highly idealized theory of the behavior of a 
“rational” person with respect to decisions. In doing so I will, of course, 
have to ask you to agree with me that such and such maxims of behavior 
are “rational.’”’ In so far as “rational’’ means logical, there is no live 
question; and, if I ask your leave there at all, it is only as a matter of 
form.{ But our person is going to have to make up his mind in situa- 
tions in which criteria beyond the ordinary ones of logic will be neces- 
sary. So, when certain maxims are presented for your consideration, 
you must ask yourself whether you try to behave in accordance with 
them, or, to put it differently, how you would react if you noticed your- 
self violating them. 

+ The assumption that a person’s behavior is logical is, of course, far from vacuous. 
In particular, such a person cannot be uncertain about decidable mathematical prop- 
ositions. This suggests, at least to me, that the tempting program sketched by Polya 
[P6] of establishing a theory of the probability of mathematical conjectures cannot 
be fully successful in that it cannot lead to a truly formal theory, but de Finetti 
[D5] seems more optimistic about the program.+ 

+ Polya has greatly elaborated his program, but not in the direction of seek- 
ing a formal theory. A curious early work by Cérésole (1915) is somewhat 
pertinent, and Hacking (1967) argues for the possibility of ineluding math- 
ematical uncertainty in a formal theory. 
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It is brought out in economic theory that organizations sometimes 
behave like individual people, so that a theory originally intended to 
apply to people may also apply to (or may even apply better to) such 
units as families, corporations, or nations. In view of this possibility, 
economic theorists are sometimes reluctant to use the word ‘‘person,”’ 
or even “individual,” for the behaving units to which they refer; but 
for our purpose “person” threatens no confusion, though the possi- 
bility of using it in an extended sense may well be borne in mind. 


3 The world, and states of the world 


A formal description, or model, of what the person is uncertain about 
will be needed. To motivate this formal description, let me begin in- 
formally by considering a list of examples. The person might be un- 
certain about: 

1. Whether a particular egg is rotten. 

2. Which, if any, in a particular dozen eggs are rotten. 

3. The temperature at noon in Chicago yesterday. 

4. What the temperature was and will be in the place now covered 
by Chicago each noon from January 1, 1 A.p., to January 1, 4000 a.p. 

5. The infinite sequence of heads and tails that will result from re- 
peated tosses of a particular (everlasting) coin. 

6. The complete decimal expansion of z. 

7. The exact and entire past, present, and future history of the uni- 
verse, understood in any sense, however wide. 

These examples have a few features in common, though, if there are 
more than a few, it is a discredit to my imagination. Thus, in each 
there is some object about which the person is uncertain, an egg, a 
dozen eggs, a temperature, a sequence of temperatures, etc. Each ob- 
ject admits a certain class of descriptions that might thinkably apply 
to it. To illustrate, the egg of Example 1 might be rotten or not; and 
the terms of the example are meant to exclude any other description 
from consideration, though, of course, a real egg has many other fea- 
tures. Again, since any subset of the dozen eggs (including the extreme 
cases of all and none at all) might be rotten, there are 2'* descriptions 
associated with Example 2. For Example 3 and each subsequent one, 
there are an infinite number of descriptions, though the array of de- 
scriptions is more complicated in some than in others, reaching the ulti- 
mate of complexity in Example 7. Example 6 is a little anomalous 
in that anything the person does not know about the description of 7 
he could know in principle by thinking sufficiently hard about it, that 
is, by logic alone. This point, banal to some readers, needs explanation 
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for others. If, for example, 7 is understood to be the area of a circle of 
unit radius, it follows by logic alone that 7 is not greater than the area 
of a square circumscribing the unit circle, that is, r < 4. By an elabo- 
ration of this method 7 can be computed to any degree of accuracy, 
and by other purely logical methods many other facts about z can be 
established, such as the fact that 7 is not a rational number. 

In connection with the concepts suggested by the preceding para- 
graph, the following nomenclature is proposed as brief, suggestive, and 
in reasonable harmony with the usages of statistics and ordinary dis- 
course. 


Term Definition 
the world the object about which the person is 
concerned 
a state (of the world) a description of the world, leaving no 


relevant aspect undescribed 
the true state (of the world) the state that does in fact obtain, i.e., 
the true description of the world 


In application of the theory, the question will arise as to which world 
to use in a given context. Thus, if the person is interested in the only 
brown egg in a dozen, should that egg or the whole dozen be taken as 
the world? It will be seen as the theory is developed that in principle 
no harm is done by taking the larger of two worlds as a model of the 
situation. One is therefore tempted to adopt, once and for all, one 
world sufficiently large, say Example 7. The most serious objection to 
this is that Example 7 is vague, and some mathematical and philosophi- 
cal experience suggests that the vagueness cannot be removed without 
ruining the universality of the example. It may also be added that the 
use of modest little worlds, tailored to particular contexts, is often a 
simplification, the advantage of which is justified by a considerable 
body of mathematical experience with related ideas. 

The sense in which the world of a dozen eggs is larger than the world 
of the one brown egg in the dozen is in some respects obvious. It may 
be well, however, to emphasize that a state of the smaller world corre- 
sponds not to one state of the larger, but to a set of states. Thus, 
“The brown egg is rotten” describes the smaller world completely, and 
therefore is a state of it; but the same statement leaves much about the 
larger world unsaid and corresponds to a set of 2'! states of it. In the 
sense under discussion a smaller world is derived from a larger by neg- 
lecting some distinctions between states, not by ignoring some states 
outright. The latter sort of contraction may be useful in case certain 
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states are regarded by the person as virtually impossible so that they 
can be ignored. 


4 Events 


An event is a set of states. For example, in connection with the 
world of Example 2, the person might well be concerned with the event 
that exactly one egg in the dozen is rotten (an event having 12 states 
as elements), or, a little less academically, that at least one of the eggs 
is rotten (an event having 2’? — 1 states as elements, i.e., all the states 
in the world but one). In connection with the world of Example 3, 
the person might be concerned with the event, having an infinite num- 
ber of states, that the temperature at noon in Chicago yesterday was 
below freezing. To give a final illustration, of a more mathematical 
flavor, consider in connection with Example 5 the event that the ratio 
of the number of heads to tails approaches 3 as the sequence progresses 
to infinity. 

In connection with any given world, there are two events that are 
of the utmost logical importance, though in ordinary discourse it may 
seem banal even to mention their existence. These are the universal 
and the vacuous events. The universal event, here to be symbolized 
by S, is the event having every state of the world as element. In so 
far as ‘“‘world’”’ has a real technical meaning, S is the world. The vacu- 
ous event, which can here be safely enough symbolized by the 0 of 
arithmetic, is the event having no states as elements. To illustrate, in 
Example 1 the event that the egg is rotten or good is the universal 
event, and that it is both rotten and good is the vacuous event. 

It 1s important to be able to express the idea that a given event con- 
tains the true state among its elements. English usage seems to offer 
no alternative to the rather stuffy expression, ‘‘the event obtains.’ 

The theory under development makes no formal reference to time. 
In particular, the concept of event as here formulated is timeless, though 
temporal ideas may be employed in the description of particular events. 
Thus, it would not be said that Lincoln’s assassination is an event that 
occurred in 1865 and that the next return of Halley’s comet is one that 
will occur in 1985, but that Lincoln’s assassination in 1865 and the 
return of Halley’s comet in, but not before, 1985 are events that 
obtain. 

Modern mathematical usage, especially that of a branch of mathe- 
matics called Boolean algebra, suggests the following table of defini- 
tions in connection with the concepts of state and event. Some of 
these are synonyms, others abbreviations, and still others new terms 
compounded out of old. 
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Though the notations introduced in Table 1 are very elementary 
and of great utility, they are not ordinarily taught except in connec- 
tion with logic or relatively advanced mathematics. A set of exercises 
illustrating their use is therefore given below in the form of a numbered 
list of statements. These statements are true whatever the sets A, B, 


TABLE 1. MATHEMATICAL NOMENCLATURE PERTAINING TO STATE AND EVENTS 
Term Definition 
(Basic terms) 
set event 
A, B,C, generic symbols for events 
S88" generic symbols for states 
S the universal event 
0 the vacuous event 
(Relations) 
se A. sis an element of A, 1.e., a state in A.f 


AC B(orBD A). A is contained in B, i.e., every element 


of A is an element of B. 


A=B. A equals B, 1.e., A is the same set as B, 
1e., A and B have exactly the same 
elements. 

(Constructs) 
the complement of A with those elements of S that are not in A 
respect to S 
~A the complement of A with respect to S 


the union of the A,’s 


U: A: 


those elements of S that are elements 
of at least one of the sets A1, Ag, etc. 
the union of the A,’s 


AUB the union of A and B, i.e., those ele- 
ments of S that are elements of A or 
B (possibly of both) 

those elements of S that are elements 
of each of the sets Aj, Ag, etc. 

the intersection of the A,’s 

the intersection of A and B, 1.e., those 
elements of S that are elements of 
both A and B 


the intersection of the A;’s 


are 
ANB 


t Typographical note: The Porson font of the Greek alphabet (a, 8, v, 5, «, £, °-*) 
is the one almost always printed, at least in America, when mathematical constants 
and variables are denoted by Greek letters. The symbol e used in this and some other 
publications to denote ‘element of’ is, however, the epsilon of the Vertical font 
(a, 8, y, 8, e, 6, -*:). Some publications use the special symbol €; and some use e, 
the Porson epsilon, presumably because of its resemblance to €. The latter usage 
entails either using ¢ for two different purposes or else changing fonts in mid alphabet 
(a, B, v, 5, ¢, ¢, °°) when constants and variables are denoted by Greek letters. 
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C may be. Mathematicians would for the most part verify them by 
translating them into English and appealing to common sense, though 
in complicated cases explicit use might be made of Exercise 9. Dia- 
grams, called Venn diagrams, in which sets are symbolized by areas, 
as illustrated by Figure 1, are often suggestive. 


~(AUB) 


"A |e” 


Figure 1 


It 1s a remarkable and useful fact that any universally valid state- 
ment about sets remains so if, throughout, U is interchanged with N, 
0 with S, and C with >. The dual in this sense of each exercise should 
be studied along with the exercise itself. For example, the dual of 
Exercise 7 is: A > B, if and only if A = A U B. Note that the first 
parts of Exercises 1 through 6 are dual to the second parts. 

It may be remarked that, if Exercises 1-6 are taken as axioms and 
7 as a definition, Exercises 8-21 and also the duality principle follow 
formally from them. For example, 10 can be proved thus: By 7, if 
Af) B is A, then A C B; but, by 1, AN A is A; therefore A C A. 
Again, 8 can be proved, using 6, 3, 2, 1, 3, and 6 in that order, thus: 


(1) ONA=(ANAWA)NAH=(RKANA)NA 
=~AN(ANA)=~\ANAH=ANAA =O. 


Such formal demonstration is fun and helps develop mathematical skill. 
In the present exercises the novice, however, should consider it as a 
possible supplement to, but not as a substitute for, demonstration by 
interpretation. 

If the exercises fail to render the notations familiar, it would be best 
to talk with someone to whom they are already familiar or failing that, 
to read in any elementary book where the subject is treated, for ex- 
ample, Chapter II, ‘‘The Boole-Schroeder Algebra,” in the text of 
Lewis and Langford [L7]. 


Exercises illustrating Boolean algebra 


LANA=A=AUA. 

2(ANB)NC=AN(BNC); AUB UC=AU(BUO. 
(These facts often render parentheses superfluous.) 

83 ANB=BNA;AUB=BUA. 
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4 AN(BUQO=(ANBU(ANC)/AU(BNOD= 
(AUB)N (AUC). 

5. SN A=A;0UA =A. 

6. AN (~A) =0;A U (WA) =S. 
7. ACB, if and only if A = ANB. 
8 


.ONA =O. 
9. A= B,ifandonlyifAC Band BCA. 
10. ACA. 


11. (AN B)CA. 

12, 1fA CB, thn (ANC) C(BNC),and(AUC)IC(BUC). 
13. (A U B) CC, if and only if A CCand BCC. 

14.0C ACS. 

15. AN (A U B) =A. 

16. ~(~A) = A. 

17. ~(A U B) = (~A) N (~B) (De Morgan’s theorem). 
18. ~0 =S. 

19 AN (WA UB)S=ANB. 

20. A C B, if and only if (~B) C (~A). 

21. A C B, if and only if A N (~B) = 0. 

22. ~(U;:4,) = ‘ar (~A,) (General De Morgan’s theorem). 
23. A U (();: Bi) = (1): (A U By). 

24. AN ((); B, = 1): (A N B)). 

25. (U: 4.) U (U; Bs) = Ui; (Ac U B)). 

26. (fi 4s) U (115 Bs) = Mes (Ai U B)). 

27. A Cc (f);B,), if and only if A C B; for every i. 

28. (1); By CBC (U: B;) for every j. 


5 Consequences, acts, and decisions 


To say that a decision is to be made is to say that one of two or more 
acts is to be chosen, or decided on. In deciding on an act, account 
must be taken of the possible states of the world, and also of the con- 
sequences implicit in each act for each possible state of the world. A 
consequence is anything that may happen to the person. 

Consider an example. Your wife has just broken five good eggs into 
a bowl when you come in and volunteer to finish making the omelet. 
A sixth egg, which for some reason must either be used for the omelet 
or wasted altogether, lies unbroken beside the bowl. You must de- 
cide what to do with this unbroken egg. Perhaps it is not too great an 
oversimplification to say that you must decide among three acts only, 
namely, to break it into the bowl containing the other five, to break it 
into a saucer for inspection, or to throw it away without inspection. 
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Depending on the state of the egg, each of these three acts will have 
some consequence of concern to you, say that indicated by Table 1. 


TABLE 1. AN EXAMPLE ILLUSTRATING ACTS, STATES, AND CONSEQUENCES 


State 
Act 
Good Rotten 
break into bowl | six-egg omelet no omelet, and five good eggs 
destroyed 
break into saucer | six-egg omelet, and a saucer | five-egg omelet, and a saucer 
to wash to wash 
throw away five-egg omelet, and one good | five-egg omelet 


egg destroyed 


Even the little example concerning the omelet suggests how varied 
the things, or experiences, regarded as consequences, can be. They 
might in general involve money, life, state of health, approval of friends, 
well-being of others, the will of God, or anything at all about which the 
person could possibly be concerned. Consequences might appropriately 
be called states of the person, as opposed to states of the world. They 
might also be referred to, with some extension of the economic notion 
of income, as the possible incomes of the person. In any one problem, 
the set of consequences envisaged will be denoted by F, and the indi- 
vidual consequences will be denoted by f, g, h, etc. In the omelet ex- 
ample, F' consists of the six consequences tabulated in Table 1: six-egg 
omelet; no omelet, and five good eggs destroyed; etc. 

If two different acts had the same consequences in every state of the 
world, there would from the present point of view be no point in con- 
sidering them two different acts at all. An act may therefore be iden- 
tified with its possible consequences. Or, more formally, an act 1s a 
function attaching a consequence to each state of the world. The nota- 
tion f will be used to denote an act, that is, a function, attachmg the 
consequence f(s) to the state s. The notation f is logically a better 
name for a function than the more customary f(s) for exactly the same 
reason that the word “logarithm” is a better term for logarithm than 
“logarithm of x’ would be. The notational distinction involved here is 
often justifiably neglected in mathematical work, but we will have spe- 
cial need to observe it, at least in connection with acts, as will soon be 
explained. When several acts are to be discussed at once, they may be 
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denoted by different letters thus: f, g, h; by the use of primes thus: f, 
f’, f’’; or by subscripts thus: f;, f;. The set of all acts available in a 
given situation will be denoted by F or a similar symbol. In the ex- 
ample of the omelet, F has three acts as elements. If, for example, f 
denotes the first of the three acts listed in Table 1, then f is defined 
thus: 
f(good) = six-egg omelet; 
(1) 
f(rotten) = no omelet, and five good eggs destroyed. 


The argument might be raised that the formal description of decision 
that has thus been erected seems inadequate because a person may not 
know the consequences of the acts open to him in each state of the 
world. He might be so ignorant, for example, as not to be sure whether 
one rotten egg will spoil a six-egg omelet. But in that case nothing 
could be simpler than to admit that there are four states in the world 
corresponding to the two states of the egg and the two conceivable 
answers to the culinary question whether one bad egg will spoil a six- 
egg omelet. It seems to me obvious that this solution works in the 
greatest generality, though a thoroughgoing analysis might not be triv- 
ial. A reader interested in the technicalities of this point or that of 
the succeeding paragraph will find an extensive discussion of a similar 
problem in Chapter ITI of [V4], where von Neumann and Morgenstern 
discuss the reduction of a general game to its reduced form. 

Again, the formal description might seem inadequate in that it does 
not provide explicitly for the possibility that one decision may lead to 
another. Thus, if the omelet should be spoiled by breaking a rotten 
egg into it, new questions might arise about what to substitute for 
breakfast and how to appease your justifiably furious wife. But, Just 
as in the preceding paragraph an apparent shortcoming of the proposed 
mode of description was attributed to an incomplete analysis of the 
possible states, here I would say that the list of available acts envisaged 
in Table 1 is inadequate for the interpretation that has just been put 
on the problem. Where the single act “‘break into bowl” now stands, 
there should be several, such as: “break into bowl, and in case of dis- 
aster have toast,” ‘break into bowl, and in case of disaster take family 
to a neighboring restaurant for breakfast.’’ Appropriate consequences 
of these new acts can easily be imagined. 

As has just been suggested, what in the ordinary way of thinking 
might be regarded as a chain of decisions, one leading to the other in 
time, is in the formal description proposed here regarded as a single de- 
cision. To put it a little differently, it is proposed that the choice of a 
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policy or plan be regarded as a single decision. This point of view, 
though not always in so explicit a form, has played a prominent role 
in the statistical advances of the present century. For example, the 
great majority of experimentalists, even today, suppose that the func- 
tion of statistics and of statisticians is to decide what conclusions to 
draw from data gathered in an experiment or other observational pro- 
gram. But statisticians hold it to be lacking in foresight to gather data 
without a view to the method of analysis to be employed, that is, they 
hold that the design and analysis of an experiment should be decided 
upon as an articulated whole. 

The point of view under discussion may be symbolized by the prov- 
erb, ‘‘Look before you leap,’’ and the one to which it is opposed by the 
proverb, “‘You can cross that bridge when you come to it.”” When two 
proverbs conflict in this way, it is proverbially true that there is some 
truth in both of them, but rarely, if ever, can their common truth be 
captured by a single pat proverb. One must indeed look before he 
leaps, in so far as the looking is not unreasonably time-consuming and 
otherwise expensive; but there are innumerable bridges one cannot 
afford to cross, unless he happens to come to them. 

Carried to its logical extreme, the “Look before you leap”’ principle 
demands that one envisage every conceivable policy for the government 
of his whole life (at least from now on) in its most minute details, in 
the light of the vast number of unknown states of the world, and decide 
here and now on one policy. This is utterly ridiculous, not—as some 
might think—because there might later be cause for regret, if things 
did not turn out as had been anticipated, but because the task implied 
in making such a decision is not even remotely resembled by human 
possibility. It 1s even utterly beyond our power to plan a picnic or to 
play a game of chess in accordance with the principle, even when the 
world of states and the set of available acts to be envisaged are artifi- 
cially reduced to the narrowest reasonable limits. 

Though the ‘‘Look before you leap” principle is preposterous if car- 
ried to extremes, I would none the less argue that it is the proper sub- 
ject of our further discussion, because to cross one’s bridges when one 
comes to them means to attack relatively simple problems of decision 
by artificially confining attention to so small a world that the “Look 
before you leap” principle can be applied there. I am unable to formu- 
late criteria for selecting these small worlds and indeed believe that 
their selection may be a matter of judgment and experience about which 
it is impossible to enunciate complete and sharply defined general prin- 
ciples, though something more will be said in this connection in § 5.5. 
On the other hand, it is an operation in which we all necessarily have 
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much experience, and one in which there is in practice considerable 
agreement. 

In view of the ‘“‘Look before you leap” principle, acts and decisions, 
like events, are timeless. The person decides ‘‘now”’ once for all; there 
is nothing for him to wait for, because his one decision provides for all 
contingencies. None the less, temporal modes of description, though 
translatable into atemporal ones, are often suggestive. ‘Thus, there 
will be occasion to analyze and make frequent use of the idea of defer- 
ring a decision until an observation relevant to it has been made. 


6 The simple ordering of acts with respect to preference 


Of two acts f and g, it is possible that the person prefers f to g. 
Loosely speaking, this means that, if he were required to decide between 
f and g, no other acts being available, he would decide on f. 

This procedure for testing preference is not entirely adequate, if only 
because it fails to take account of, or even define, the possibility that 
the person may not really have any preference between f and g, re- 
garding them as equivalent; in which case his choice of f should not be 
regarded as significant. If the person really does regard f and g as 
equivalent, that is, if he is indifferent between them, then, if f or g 
were modified by attaching an arbitrarily small bonus to its conse- 
quences in every state, the person’s decision would presumably be for 
whichever act was thus modified. This test for indifference does not 
provide an altogether satisfactory definition, since it begs the question 
to some extent by postulating in effect that the tester knows what con- 
stitutes a small bonus. Another attempted solution would be to say 
that the person knows by introspection whether he has decided hap- 
hazardly or in response to a definite feeling of preference. This sort of 
solution seems to me especially objectionable, because I think it of 
great importance that preference, and indifference, between f and g be 
determined, at least in principle, by decisions between acts and not by 
response to introspective questions. In spite of the difficulty of dis- 
tinguishing between preference and indifference, I think enough has 
been said for us to proceed to a postulational treatment of them. 

The very meaning of the relationship of preference that I have at- 
tempted to establish in the preceding paragraph implies that the per- 
son cannot simultaneously prefer f to g and g tof. In the postulational 
treatment of the relationships of preference and indifference, it will be 
technically convenient to work with the relation ‘is not preferred to”’ 
rather than directly with its complementary relation ‘‘is preferred to.” 
Thus, rather than say that it is impossible that both f is preferred to 
g and g tof, I might say that, of any two acts f and g, f is not preferred 


18 PRELIMINARY CONSIDERATIONS ON DECISION [2.6 


to g or g is not preferred to f, possibly both. Again, the definition of 
preference suggests that, if f is not preferred to g, and g is not preferred 
to h, then it is impossible that f should be preferred to h. 

The two assumptions just made about the relation “is not preferred 
to”? is sometimes expressed in ordinary mathematical usage by saying 
that the relation is a simple ordering among acts. Formally, a relation 
<- among a set of elements z, y, 2 ---, is called a simple ordering, in 
this book, if and only if for every x, y, and z: 


1. Hither z <-y, ory <>. 
2. Ifa <-y, and y <-z, then z <-z. 


Borrowing from arithmetic the suggestive abbreviation < for the re- 
lation “is not preferred to,” the assumption that < is a simple order- 
ing can be expressed formally by a postulate, thus: 


Pl The relation < is a simple ordering among acts. 


It is noteworthy that P1 makes no explicit reference to states of the 
world. Except possibly for mathematical refinements,t it seems to me 
that no additional postulates can be formulated without making such 
reference—at any rate none will be in this book. 

P1 by itself is not very rich in consequences, but one easily proved 
theorem following from it may be mentioned. 


THEOREM 1 If F is a finite set of acts, there exist f and h in F such 


that for all g in F 
f<g<h. 


Theorem 1 is especially relevant to application of the theory of de- 
cision, because I interpret the theory to imply that, if F is finite, the 
person will decide on an act h in F to which no other act in F is pre- 
ferred, the existence of at least one such h being guaranteed by the 
theorem. 

It is often appropriate to consider infinite sets of available acts. In 
economic contexts, for example, it is generally an inappropriate com- 
plication to take explicit account of the possibility that all transactions 
must be in integral numbers of pennies. If infinite sets of available acts 
are set up and interpreted without some mathematical tact, unrealistic 
conclusions are likely to follow. Suppose, for example, that you were 
free to choose any income, provided it be definitely less than $100,000 
per year. Precisely which income would you choose, abstracting from 
the indivisibility of pennies? 

+ For example, such topological assumptions about the space with neighborhoods 
defined in terms of < as connectedness, local compactnesss, or density. 
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It is sometimes convenient to supplement the relation < by other 
relations derived from it in accordance with the definitions in Table 1, 
analogous definitions being applicable to any simple ordering. The as- 
sumption of simple ordering, Pl, has several implications for the de- 
rived relations >, <, >, and =. These are generally strongly sug- 
gested by the properties of the corresponding relations in arithmetic. 


TABLE 1. TABLE OF RELATIONS DERIVED FROM < 


New Relation Definition 
f> g. g<i. 
f < g, ie., g is preferred to f. It is false that g < f. 
f > g. g <f. 


f = g,ie.,fis equivalent to (or f<g,andg<f. 
indifferent with respect to) g. 


g is between f and h. f<g<horh<g<f. 


A few such implications of P1 are listed below, with no intention of 
completeness, as exercises for those who may not already be familiar 
with the elementary properties of simple ordering. 


Exercises 


1. The relation > is also a simple ordering. 

2. All the relations <, >, <, >, and = are transitive, that is, they 
can be validly substituted for < in the second part of the definition of 
simple ordering. 

3. Between any pair of acts f, g, one and only one of the three rela- 
tions <, =, and > holds. 

4. Iff < g, and g =h, thenf <h. 

5. If f = g, then g = f. 

6. For any f, f = f. 

7. At least one of three acts f, g, h is between the other two. When 
can there be more than one such? 


Two very different sorts of interpretations can be made of Pl and 
the other postulates to be adduced later. First, Pl can be regarded as 
a prediction about the behavior of people, or animals, in decision situa- 
tions. Second, it can be regarded as a logic-like criterion of consist- 
ency in decision situations. For us, the second interpretation is the 
only one of direct relevance, but it may be fruitful to discuss both, 
calling the first empirical and the second normative. 


20 PRELIMINARY CONSIDERATIONS ON DECISION [2.6 


Logic itself admits an empirical as well as a normative interpreta- 
tion. Thus, if an experimental subject believes certain propositions, 
it is to be expected that he will also believe their logical consequences 
and disbelieve the negations of these consequences. This theory of hu- 
man psychology has some validity and is of great practical utility in our 
everyday dealings with other people, though it is very crude and ap- 
proximate. For one thing, people often do make elementary mistakes 
in logic; more refined theories would attribute these mistakes to such 
things as accident or subconscious motivation. For another, if any- 
one who believed the axioms of mathematics also believed all that they 
imply and nothing that they contradict, mathematical study would be 
superfluous for him; such a person would, as has been explained, be 
able to state the ten-thousandth or any other term in the decimal ex- 
pansion of z on demand. To summarize, logic can be interpreted as a 
crude but sometimes handy empirical psychological theory. 

The principal value of logic, however, is in connection with its norma- 
tive interpretation, that is, as a set of criteria by which to detect, with 
sufficient trouble, any inconsistencies there may be among our beliefs, 
and to derive from the beliefs we already hold such new ones as con- 
sistency demands. It does not seem appropriate here to attempt an 
analysis of why and in what contexts we wish to be consistent; it is 
sufficient to allude to the fact that we often do wish to be so. 

Analogously, P1 together with the postulates to be adduced later can 
be interpreted as a crude and shallow empirical theory predicting the 
behavior of people making decisions. This theory is practical in suitably 
limited domains, and everyone in fact makes use of at least some as- 
pects of it in predicting the behavior of others. At the same time, the 
behavior of people is often at variance with the theory. The departure 
is sometimes flagrant, in which case our attitude toward it is much like 
that we hold toward a slip in logic, calling the departure a mistake and 
attributing it to such things as accident and subconscious motivation. 
Or, the departure may be detectable only by a long chain of argument 
or calculation, the possibilities becoming increasingly complicated as 
new postulates are brought to stand beside PI. 

Pursuing the analogy with logic, the main use I would make of Pl 
and its successors is normative, to police my own decisions for consist- 
ency and, where possible, to make complicated decisions depend on 
simpler ones. 

Here it is more pertinent than it was in connection with logic that 
something be said of why and when consistency is a desideratum, though 
I cannot say much. Suppose someone says to me, “I am a rational 
person, that is to say, I seldom, if ever, make mistakes in logic. But I 
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behave in flagrant disagreement with your postulates, because they vio- 
late my personal taste, and it seems to me more sensible to cater to my 
taste than to a theory arbitrarily concocted by you.” I don’t see how 
I could really controvert him, but I would be inclined to match his in- 
trospection with some of my own. I would, in particular, tell him that, 
when it is explicitly brought to my attention that I have shown a pref- 
erence for f as compared with g, for g as compared with h, and for h as 
compared with f, I feel uncomfortable in much the same way that I do 
when it is brought to my attention that some of my beliefs are logically 
contradictory. Whenever I examine such a triple of preferences on my 
own part, I find that it is not at all difficult to reverse one of them. In 
fact, I find on contemplating the three alleged preferences side by side 
that at least one among them is not a preference at all, at any rate not 
any more. 

There is some temptation to explore the possibilities of analyzing 
preference among acts as a partial ordering, that is, in effect to replace 
part 1 of the definition of simple ordering by the very weak proposition 
f < f, admitting that some pairs of acts are incomparable. This would 
seem to give expression to introspective sensations of indecision or vacil- 
lation, which we may be reluctant to identify with indifference. My 
own conjecture is that it would prove a blind alley losing much in power 
and advancing little, if at all, in realism; but only an enthusiastic ex- 
ploration could shed real light on the question. 


7 The sure-thing principle 

A businessman contemplates buying a certain piece of property. He 
considers the outcome of the next presidential election relevant to the 
attractiveness of the purchase. So, to clarify the matter for himself, 
he asks whether he would buy if he knew that the Republican candidate 
were going to win, and decides that he would do so. Similarly, he con- 
siders whether he would buy if he knew that the Democratic candidate 
were going to win, and again finds that he would do so. Seeing that he 
would buy in either event, he decides that he should buy, even though 
he does not know which event obtains, or will obtain, as we would ordi- 
narily say. It is all too seldom that a decision can be arrived at on the 
basis of the principle used by this businessman, but, except possibly 
for the assumption of simple ordering, I know of no other extralogical 
principle governing decisions that finds such ready acceptance. 

Having suggested what I shall tentatively call the sure-thing prin- 
ciple, let me give it relatively formal statement thus: If the person 
would not prefer f to g, either knowing that the event B obtained, or 
knowing that the event ~B obtained, then he does not prefer f to g. 
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Moreover (provided he does not regard B as virtually impossible) if he 
would definitely prefer g to f, knowing that B obtained, and, if he would 
not prefer f to g, knowing that B did not obtain, then he definitely pre- 
fers g to f. 

The sure-thing principle cannot appropriately be accepted as a postu- 
late in the sense that P1 is, because it would introduce new undefined 
technical terms referring to knowledge and possibility that would ren- 
der it mathematically useless without still more postulates governing 
these terms. It will be preferable to regard the principle as a loose one 
that suggests certain formal postulates well articulated with P1. 

What technical interpretation can be attached to the idea that f 
would be preferred to g, if B were known to obtain? Under any rea- 
sonable interpretation, the matter would seem not to depend on the 
values f and g assume at states outside of B. There is, then, no loss 
of generality in supposing that f and g agree with each other except in 
B, that is, that f(s) = g(s) for all se ~B. Under this unrestrictive as- 
sumption, f and g are surely to be regarded as equivalent given ~B; 
that is, they would be considered equivalent, if it were known that B 
did not obtain. The first part of the sure-thing principle can now be 
interpreted thus: If, after being modified so as to agree with one an- 
other outside of B, f is not preferred to g; then f would not be preferred 
to g, if B were known. The notion will be expressed formally by say- 
ing that f < g given B.t+ 

It is implicit in the argument that has just led to the definition of 
f < g given B that, if two acts f and g are so modified in ~B as to agree 
with each other, then the order of preference obtaining between the 
modified acts will not depend on which of the permitted modifications 
was actually carried out. Equivalently, if f and g are two acts that do 
agree with each other in ~B, and f < g; then, if f and g are modified 
in ~B in any way such that the modified acts f’ and g’ continue to 
agree with each other in ~B, it will also be so that f’ < g’. This as- 
sumption is made formally in the postulate P2 below and illustrated 
schematically in Figure 1, a kind of diagram I find suggestive in many 
such contexts. 

In Figure 1, the set S of all states s and the set F of al] consequences 
f are represented by horizontal and vertical intervals respectively. In 
any such diagram an act f, being a function attaching a value f(s) ¢« F 
to each s ¢ S is represented by a graph. This particular diagram graphs 
two acts f and g that agree with each other in ~B, and two other acts 
f’ and g’ that also agree with each other in ~B and arise by modifying 
f and g respectively only in ~B, that is, acts agreeing with f and g 
respectively in B. 

+ In this edition, the corresponding definition D1 on the end papers has 


been slightly strengthened to compensate an inadvertent weakness in the end 
paper version of P2, pointed out to me by Peter Fishburn. 


2.7] THE SURE-THING PRINCIPLE 23 


g(s,)=8'(s,) 


F | 
fs, )=f'(s,) 4 


f'(8,) = g'(s,) 


f(8.) = g(85) 


Figure 1 


P2 If f, g, and f’, g’ are such that: 
1. in ~B, f agrees with g, and f’ agrees with g’, 
2. in B, f agrees with f’, and g agrees with g’, 
3. fi<g; 

then f’ < g’. 


Each of the relations ‘‘“< given B’’ is now easily seen to be a simple 
ordering, and the relations “‘>, <, >, = given B”’ are to be defined 
mutatis mutandis. It is noteworthy though obvious that, if f(s) = g(s) 
for all s ¢ B, then f = g given B. 

It is now possible and instructive to give an atemporal analysis of 
the following temporally described decision situation: The person must 
decide between f and g after he finds out, that is, observes, whether B 
obtains; what will his decision be if he finds out that B does in fact 
obtain? 

Atemporally, the person can submit himself to the consequences of 
f or else of g for all s ¢ B, and, independently, he can submit himself to 
the consequences of f or else of g for all s e ~B; which alternative will 
he decide upon for the s’s in B? 

Finally, describing the situation not only atemporally but also quite 
formally, the person must decide among four acts defined thus: 


hoo agrees with f on B and withf on ~B, 
ho; agrees with f on B and with g on ~B, 
hig agrees with g on B and with f on ~B, 
h,; agrees with g on B and with g on ~B. 


The question at issue now takes this form. Supposing that none of 
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the four functions is preferred to the particular one h;;, is 7 = 0, or is 
i = 1; that is, does h;; agree with f on B or with g on B? 

It is not hard to see that 7 can be 1, if and only iff < g given B. In- 
deed, if 7 = 1, ho; < h;;, which means that f < g given B. Arguing in 
the opposite direction, if f < g given B; then hog < hyo, and hg; < hy. 
Suppose now, for definiteness, hyg < h,;, then none of the four possi- 
bilities is preferred to h,,; this proves the point in question. 

It may fairly be said that the person considers B virtually impossible, 
or that B is null; if and only if, for all f and g, f < g given B. Indeed, 
if B is null in this sense, the values acts take on elements of B are irrele- 
vant to all decisions. 

Several trivial conclusions about null events are listed as a compound 
theorem, all components but the last of which have immediate intuitive 
interpretations. 


THEOREM 1 


1. The vacuous event, 0, is null. 

. B is null, if and only if, for every f and g, f = g given B. 
. If Bis null, and B > C; then C is null. 

. If ~B is null; f < g given B, if and only if f < g. 

. f < g given S, if and only if f < g. 

. If S is null, f = g for every f and g. 


Oo Ot HB O bO 


Component 6 of Theorem 1 requires comment, because it corresponds 
to a pathological situation. In case S is null, it is not really intuitive 
to say that S (and therefore every event) is virtually impossible. The 
interpretation is rather that the person simply doesn’t care what hap- 
pens to him. This is imaginable, especially under a suitably restricted 
interpretation of F’, but it is uninteresting and will accordingly be ruled 
out by a later postulate, P95. 

A finite set of events B; is a partition of B; if B; N B; = 0, for7 #7, 
and (J; B; = B. With this definition, it is easily proved by arithmetic 
induction that 


THEOREM 2 If B; is a partition of B, and f < g given B; for each 1, 
then f < g given B. If, in addition, f < g given B; for at least one J, 
then f < g given B. 


COROLLARY | The union of any finite number of null events is null. 


There are still other interesting consequences of Theorem 2, which 
may be most conveniently mentioned informally. If, in Theorem 2, 
B = S (or, more generally, if ~B is null), it is superfluous to say “given 
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B” in the conclusions of the theorem. If f = g given B; for each 2, 
then f = g given B. So much for the consequences of P2. 

Acts that are constant, that is, acts whose consequences are inde- 
pendent of the state of the world, are of special interest. In particular, 
they lead to a natural definition of preference among consequences in 
terms of preference among acts. Following ordinary mathematical us- 
age, f = g will mean that f is identically g, that is, for every s, f(s) = g. 
A formal definition of preference among consequences can now con- 
veniently be expressed thus. For any consequences g and g’, g < g’; 
if and only if, when f = g and f’ = g’, f < f’. 

In the same spirit, meaning can be assigned to such expressions as 
f <g,g < f given B, etc., and I will freely use such expressions without 
defining them explicitly. In particular, f < g given B has a natural 
meaning, but one that is rendered superfluous by the next postulate, 
P3. 

Incidentally, it is now evident how awkward for us it would be to 
use f(s) for f; because f(s) < g(s) is a statement about the consequences 
f(s) and g(s), whereas f < g is a statement about acts, and we will 
have frequent need for both sorts of statements. 

Suppose that f = g, and f’ = g’, and that g < g’, is it reasonable to 
admit that, for some B, f > f’ given B? That depends largely on the 
interpretation we choose to make of our technical terms, as an example 
helps to bring out.+ 

Before going on a picnic with friends, a person decides to buy a 
bathing suit or a tennis racket, not having at the moment enough money 
for both. If we call possession of the tennis racket and possession of 
the bathing suit consequences, then we must say that the consequences 
of his decision will be independent of where the picnic is actually held. 
If the person prefers the bathing suit, this decision would presumably 
be reversed, if he learned that the picnic were not going to be held 
near water. Thus the question whether it can happen that f > f’ 
given B would be answered in the affirmative. But, under the interpre- 
tation of ‘act’? and ‘consequence’ I am trying to formulate, this is 
not the correct analysis of the situation. The possession of the tennis 
racket and the possession of the bathing suit are to be regarded as acts, 
not consequences. (It would be equivalent and more in accordance 
with ordinary discourse to say that the coming into possession, or the 
buying, of them are acts.) The consequences relevant to the decision 
are such as these: a refreshing swim with friends, sitting on a shadeless 
beach twiddling a brand-new tennis racket while one’s friends swim, 
etc. It seems clear that, if this analysis is carried to its limit, the ques- 
tion at issue must be answered in the negative; and I therefore propose 


+ The role of such freedom throughout science is brilliantly discussed by 
Quine (1951). 
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to assume the negative answer as a postulate. The postulate is so 
couched as not only to assert that knowledge of an event cannot estab- 
lish a new preference among consequences or reverse an old one, but 
also to assert that, if the event is not null, no preference among conse- 
quences can be reduced to indifference by knowledge of an event. 


P3 If f = g, f' = g’, and B is not null; then f < f’ given B, if and 
only if g < g’. 


Applying Theorem 2, it is obvious that 


THEOREM 3 If B; is a partition of B; and if (for all 7 and s) f; < gj, 
f(s) = fi, and g(s) = g; when s ¢ B;; then f < g given B. If, in addi- 
tion, f; < g; for some 7 for which B; is not null, then f < g given B. 


Theorem 3 is logically equivalent to P3 in the presence of Pl and P2, 
and Theorem 3 can as easily be given an intuitive basis as the postulate 
P3. Therefore the assumption of P3 as a postulate instead of Theorem 
3 is only a matter of taste. 

Theorem 3 has been widely accepted by the British-American School 
of statisticians, special emphasis having been given to it, in connection 
with his notion of admissibility, by the late Abraham Wald. I believe, 
as will be more fully explained later, that much of its particular sig- 
nificance for that school stems from the implication that, if several 
different people agree in their preferences among consequences, then 
they must also agree in their preferences among certain acts. 

This brings the present chapter to a natural conclusion, since the 
further postulates to be proposed can be more conveniently introduced 
in connection with the uses to which they are put in later chapters. 


CHAPTER 3 


Personal Probability 


1 Introduction 


I personally consider it more probable that a Republican president 
will be elected in 1996 than that it will snow in Chicago sometime in the 
month of May, 1994. But even this late spring snow seems to me more 
probable than that Adolf Hitler is still alive. Many, after careful con- 
sideration, are convinced that such statements about probability to a 
person mean precisely nothing, or at any rate that they mean nothing 
precisely. At the opposite extreme, others hold the meaning to be so 
self-evident as to be unanalyzable. An intermediate position’ is taken 
in this chapter, where a particular interpretation of probability to a 
person is given in terms of the theory of consistent decision in the face 
of uncertainty, the exposition of which was begun in the last chapter. 
Much as I hope that the notion of probability defined here is consistent 
with ordinary usage, it should be judged by the contribution it makes 
to the theory of decision, not by the accuracy with which it analyzes 
ordinary usage. 

Perhaps the first way that suggests itself to find out which of two 
events a person considers more probable is simply to ask him. It might 
even be argued, though I think fallaciously, that, since the question 
concerns what is inside the person’s head, there can be no other method, 
just as we have little, if any, access to a person’s dreams except through 
his verbal report. Attempts to define the relative probability of a pair 
of events in terms of the answers people give to direct interrogation 
has justifiably met with antipathy from most statistical theorists. In 
the first place, many doubt that the concept ‘‘more probable to me 
than” is an intuitive one, open to no ambiguity and yet admitting no 
further analysis. Even if the concept were so completely intuitive, 
which might justify direct interrogation as a subject worthy of some 
psychological study, what could such interrogation have to do with the 
behavior of a person in the face of uncertainty, except of course for his 
verbal behavior under interrogation? If the state of mind in question 
is not capable of manifesting itself in some sort of extraverbal behavior, 
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it is extraneous to our main interest. If, on the other hand, it does 
manifest itself through more material behavior, that should, at least 
in principle, imply the possibility of testing whether a person holds 
one event to be more probable than another, by some behavior express- 
ing, and giving meaning to, his judgment. It would, in short, be pref- 
erable, at least in principle, to interrogate the person, not literally 
through his verbal answer to verbal questions, but rather in a figurative 
sense somewhat reminiscent of that in which a scientific experiment is 
sometimes spoken of as an interrogation of nature. Several schemes of 
behavioral, as opposed to direct, interrogation have been proposed. 
The one introduced below was suggested to me by a passage of de Fi- 
netti’s (on pp. 5-6 of [D2]), though the passage itself does not empha- 
size behavioral interrogation. 

To illustrate the scheme, our idealized person has just taken two 
eggs from his icebox and holds them unbroken in his hand. We wonder 
whether he thinks it more probable that the brown one is good than 
that the white one is. Our curiosity being real, we are prepared to 
pay, if necessary, to have it satisfied. We therefore address him thus: 
‘“‘We see that you are about to open those eggs. If you will be so co- 
operative as to guess that one or the other egg is good, we will pay you 
a dollar, should your guess prove correct. If incorrect, you and we 
are quits, except that we will in any event exchange your two eggs for 
two of guaranteed goodness.”’ If under these circumstances the person 
stakes his chance for the dollar on the brown egg, it seems to me to 
correspond well with ordinary usage to say that it is more probable to 
him that the brown one is good than that the white one is. Though, 
of course, I hope for your agreement on this analysis of ordinary usage, 
I repeat that it is not really fundamental to the subsequent argument, 
as indeed no such lexicographical point could be; for the utility of a 
construct or definition depends only secondarily on the aptness of the 
expression in terms of which it is couched. 

There is a mode of interrogation intermediate between what I have 
called the behavioral and the direct. One can, namely, ask the person, 
not how he feels, but what he would do in such and such a situation. 
In so far as the theory of decision under development is regarded as 
an empirical one, the intermediate mode is a compromise between econ- 
omy and rigor. But, in the theory’s more important normative inter- 
pretation as a set of criteria of consistency for us to apply to our own 
decisions, the intermediate mode seems to me to be just the right 
one. 

Though it entails digression from the main theme, some readers may 
be interested in a few words about actual experimentation on strictly 
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empirical behavioral interrogation. Some key references bearing on 
the subject are [M4], [R3], and [W8]. 

In the first place, a little reflection shows that an experiment in which 
human subjects are required to decide among actual acts may be very 
expensive in time, money, and effort, especially if the consequences en- 
visaged are expensive to provide, a point discussed in detail in [W8]. 
Questions of morality, and even of legality, toward the subject may 
further complicate the investigation. For example, Mosteller and No- 
gee, as described in Section 3B of [M4], made certain that every sub- 
ject in one experiment of theirs would be financially benefited, though 
they kept this security secret from the subjects. 

There is also a difficulty in principle. Suppose that I wish to dis- 
cover a person’s preferences among several acts—three acts f, g, and h 
are sufficient to bring out the difficulty. If I in good faith offer him the 
opportunity to decide among all three, and he decides on f; then there 
is no further possibility of discovering what his preference was between 
gandh. Suppose, for example, that a hot man actually prefers a swim, 
a shower, and a glass of beer, in that order. Once he decides on, and 
thereby becomes entitled to, the swim, he can no longer appropriately 
be asked to decide between shower and beer. A naive attempt to do so 
would result in his deciding between a swim and shower on the one 
hand, and a swim and beer on the other—an altogether different situa- 
tion from the one intended. 

The difficulty can sometimes be met by special devices. For example, 
the investigator might wait for a different but “similar” occasion. But 
W. Allen Wallis has mentioned to me an interesting and very general 
device, which will now be described, with his permission. t 

Suppose that the hot man is instructed to rank the three acts in 
order, subject to the consideration that two of them will be drawn at 
random (e.g., by card drawing or dice rolling), and that he is then to 
have whichever of these two acts he has assigned the lower rank. He 
is thus called on to select one of six acts, that is, one of the six possible 
rankings. If he does, for example, select the ranking {swim, shower, 
beer}, it follows easily from the theory of decision thus far developed 
that for him swim > shower > beer, barring the farfetched possibility 
that he regards one or more of the three drawings as virtually impossi- 
ble and provided that his preference among the three acts swim, shower, 
beer given any of the three drawings is the same as his original prefer- 
ence. The investigator could in practice design the drawing in such a 


} I have since seen this same device used by M. Allais. 
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way as to be well satisfied that the required “irrelevance” obtained, ex- 
cept for very “superstitious” people. This ends the present digression on 
actual behavioral interrogation. 

The purpose of this chapter is to explore the concept of personal 
probability t that was indicated in the example about the two eggs. 
The concept will be put on a formal basis in § 2 by introducing two new 
postulates, P4 and P5, to be used in conjunction with P1-3. This will 
lead to a formal analysis of the notion that one event is no more prob- 
able than another. Several deductions about this notion reminiscent 
of mathematical properties ordinarily attributed to probability will be 
made; but only in § 3, after adjunction of still another postulate, P6, 
can the notion be connected quantitatively with what mathematicians 
ordinarily call mathematical probability. Section 4 is devoted to some 
mathematically technical criticisms of the notion of personal proba- 
bility, which can safely be skipped or skimmed by those not interested 
in such matters. Section 5 discusses conditional personal probability; 
6, the approach to certainty through a long sequence of conditionally 
independent relevant observations; and 7, an extension of the concept 
of a sequence of independent events, particularly interesting from the 
viewpoint of personal probability. 


2 Qualitative personal probability 


When I spoke in the introductory section of offering the person a 
dollar if his guess about the egg proved correct, it was tacitly assumed 
that his guess would not be affected by the amount of the prize offered. 
That seems to me correct in principle. It would, for example, seem un- 
reasonable for the person with the two eggs to reverse his decision if 
the prize were reduced from a dollar to a penny. He might reverse 
himself in going from a penny to a dollar, because he might not have 
found it worth his trouble to give careful consideration for too small a 
prize. I think the anomaly can best be met by deliberately pretending 
that consideration costs the person nothing, though that is far from the 
truth in actual complicated situations. It might, on the other hand, 
be stimulating, and it is certainly more realistic, to think of considera- 
tion or calculation as itself an act on which the person must decide. 
Though I have not explored the latter possibility carefully, I suspect 
that any attempt to do so formally leads to fruitless and endless re- 
gression. 

+ The term “personal probability’? was suggested to me orally by Thornton C. 


Fry. Some other terms suggested for the same concept are “subjective probability,” 
“psychological probability,” and “degree of conviction.” 
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To offer a prize in case A obtains means to make available to the per- 
son an act f4 such that 


fa(s) =f fors¢A, 


1 
m fa(s) =f’ forse ~A, 


where f’ < f. The assumption that on which of two events the person 
will choose to stake a given prize does not depend on the prize itself 
is expressed by the following postulate, which looks formidable only 
because it contains four definitions like (1). The reader may find it 
helpful to graph an instance of the postulate in the spirit of Figure 
2.7.1. 


P4siff, f’, 9, 9’; A, B; fa, fe, £4, Se are such that: 


1. ef g <9; 
2a. fa(s) =f, ga(s) =9 forse A, 
fa(s) =f’, gals) =9' = forse ~A; 
2b. fa(s) =f, ga(s) = 9 for se B, 
fa(s) =f’,  ga(s)=9' = forse ~B; 
3. f4 < fp; 
then ga < gp. 


In the light of P4, it will be said that A is not more probable than 
B, abbreviated A < B; if and only if when f’ < f and fag, fg are such 
that 

fa(s) =f forseA, fa(s) =f’ forse ~A, 


fa(s) =f forseB, fa(s) =f’ forse ~B; 


then f A < fp. 

The assumption that there is at least one worth-while prize is in- 
nocuous; for, though a context failing to satisfy it might arise, such a 
context would be too trivial to merit study. I therefore propose the 
following postulate. 


P5 There is at least one pair of consequences f, f’ such that f’ < f. 


All the implications to be deduced from P1-5 for some time to come 
are themselves implications of the three easily established conclusions, 
which are introduced by the following definition and theorem. 
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A relation <- between events is a qualitative probability; if and only 
if, for all events B, C, D, 


1. <+is a simple ordering, 

2. B<-C, if and only if BU D<-C UD, provided BN D= 
Cf D=0, 

3.0<-B,0<:-S. 


It may be helpful to remark that the second part of the above defini- 
tion says, in effect, that it will not affect the person’s guess to offer 
him a consolation prize in case neither B nor C' obtains, but D happens 
to. 


THEOREM 1 The relation < as applied to events is a qualitative 
probability. 


You will have no difficulty in proving that Theorem 1 follows from 
P1-5. Theorem 1 has many consequences of the sort one would expect 
if < meant “not more probable than” in any sense having the mathe- 
matical properties ordinarily attributed to numerical probability. This 
is illustrated by the following list of exercises, which should not only 
be proved formally, but also interpreted intuitively. One easy exercise 
not included in the list below, because it is not strictly a consequence 
of Theorem 1 alone, is to show that B = 0, if and only if B is a null 
event. 


Exercises 


LifbBcc,thnO<B<CC<S. 

2a. If BN D=CfN D=0; then B < C, if and only if BUD< 
CU D. 

2b. 1f0<C,andBNC=0;thnB< BUC. 

3. If B< C, then ~C < ~B; and conversely. Hint: Draw a Venn 
diagram of the fourfold partition BNC, ~B NC, BN ~C, ~BN 
~C. 

4a. If B< CrandCN D=0;thnBUD<CUD. 

4b. If B < 0; then B U C = C, and B = 0. 

4c. fS < Bj then BN C=C,and B=S. 

44.ifBUD<CUD,and BN D=0;thenB< C. 

5a. If B, me Ch, Bo < Co, and Ci a C2 = Q; then B, U Bo < C; U 
C.. Hint: Exhibit B, and C, in the form Bz = Bp’ UQ,Ci = Cy’ UQ 
with By’, Cy’, Q disjoint. Justify the following calculation, step by step. 


B, U BY’ < C, U Bo’ = C,' U Be < Cy’ U Cs, 
whence B, U By. < C, U Co. 
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5b. If B, UB, <C; UC, and B,N Be = 0; then By, < Cy or 
Bo < Co. 

6. If B< ~B and C > ~C, then B < C; equality holding in the 
conclusion, if and only if it holds in both parts of the hypothesis. 


3 Quantitative personal probability 


As I have said, the exercises terminating the preceding section sug- 
gest a close mathematical parallelism between personal probability and 
the mathematical properties ordinarily attributed to probability, though 
the postulates assumed thus far do not (as could easily be demonstrated) 
make it possible to deduce from this parallelism the unambiguous as- 
signment of a numerical probability to each event. But, if, for example 
(following de Finetti [D2]), a new postulate asserting that S can be 
partitioned into an arbitrarily large number of equivalent subsets were 
assumed, it is pretty clear (and de Finetti explicitly shows in [D2]) 
that numerical probabilities could be so assigned. It might fairly be 
objected that such a postulate would be flagrantly ad hoc. On the 
other hand, such a postulate could be made relatively acceptable by 
observing that it will obtain if, for example, in all the world there is a 
coin that the person is firmly convinced is fair, that is, a coin such that 
any finite sequence of heads and tails is for him no more probable than 
any other sequence of the same length; though such a coin is, to be sure, 
a considerable idealization. 

After some general and abstract discussion of the mathematical con- 
nection between qualitative and quantitative probability, a postulate, 
P6, will be proposed, which, though logically actually stronger than the 
assumption that there are partitions of S into equivalent events, seems 
to me even easier to accept. Once P6 is accepted, there will scarcely 
again be any need to refer directly to qualitative probability. 

To begin with, let me say precisely what is meant, in the present 
context, by a probability measure, this being the standard term for 
what I would here otherwise prefer to call a quantitative probability, 
and what it means for a probability measure to be in agreement with 
a qualitative probability. 

A probability measure on a set S is a function P(B) attaching to 
each B C S a real number such that: 


1. P(B) > 0 for every B. 

21 BNC=0,P(BUC) = P(B) + PC). 

3. P(S) = 1. 
This definition, or something very like it, is at the root of all ordinary 
mathematical work in probability. 
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If S carries a probability measure P and a qualitative probability 
<+ such that, for every B, C, P(B) < P(C), if and only if B <-C; 
then P (strictly) agrees with <-. If B<-C implies P(B) < P(C), 
then P almost agrees with <-. This terminology is obviously con- 
sistent in that, if P agrees, that is, strictly agrees, with <:, P also al- 
most agrees with <-. It is also easily seen that, if P agrees with <-., 
then knowledge of P implies knowledge of <-. But, if P only almost 
agrees with <-, it may happen, as examples in § 4 show, that P(B) = 
P(C), though B <: C, so that knowledge of P may imply only imperfect 
knowledge of <:. 

The rest of this section is mainly a study of qualitative probabilities 
generally, with a view to discovering interesting conditions under which 
there is a probability measure that agrees, either strictly or almost, 
with a given qualitative probability. These conditions suggest a new 
postulate governing the special qualitative probability <. The work 
is necessarily rather tedious and burdened with detail. It will, there- 
fore, be wise for most readers to skim over the material, omitting the 
proofs but noticing the more obvious logical connections among the 
theorems and definitions. Some may then find themselves sufficiently 
interested in the details to return and read or supply the proofs, as the 
case may require. Others may safely go forward. Here, as elsewhere, 
technical terms of interest for the moment only are introduced with 
italics rather than boldface. 

An n-fold almost uniform partition of B is an n-fold partition of B 
such that the union of no r elements of the partition is more probable 
than that of any r + 1 elements. 


THEOREM 1 If there exist n-fold almost uniform partitions of B for 
arbitrarily large values of n, then there exist m-fold almost uniform par- 
titions for every positive integer m. 


Proor. Let B;,7 = 1, ---, n, be an n-fold almost uniform partition 
(of B) with n > m?. Using the euclidean algorithm, let n be written 
n = am-+ b, where a and 6 are integers such that m < a and0O <b < 
m. Now let Cj, 7 = 1, --:, m, be any m-fold partition such that each 
C;; is the union of a ora + 1 of the B,’s. The union of any r of the C;’s, 
r < m, is the union of from ar to (a + 1)r of the B,’s and the union of 
r+ 1 of the C;’s is that of from a(r + 1) to (a+ 1)(r + 1) of the B,’s. 
Since r<mc<a, (a+ 1l)r=ar+r<ar+a=a(r+1).?@ 


THEOREM 2 If there exist n-fold almost uniform partitions of S for 
arbitrarily large values of n, then there is one and only one probability 
measure P that almost agrees with <-. Furthermore, for any p, 0 < p 
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<1, any BCS, and the unique P just defined, there exists C C B 
such that P(C) = pP(B).t 


Proor. The proof is broken into a sequence of easy steps, left, for 
the most part, to the reader. These steps are grouped in blocks, only 
the last step in each being needed in the proof of later steps. 


1. There exist n-fold almost uniform partitions of S for every posi- 
tive n. 

2a. If pi, «+, Pn are real numbers such that 0 < p, < po <---< Mn, 
and 2p; = 1; then 


(1) Sp<r/n, r=1-yn. 
1 
2b. If further 
r+1 n 
yD Di = >» Di forr=1,--:,n—1; 
1 n—r+l1 
then 
(2) DPS (r—V/n, and DY p< (rt /n. 
: n—r+l1 


2c. The sum of any r of the p,’s lies between (r — 1)/n and (r + 1)/n. 

2d. If P almost agrees with <-, and C(r, n) denotes here and later 
in this proof any union of r elements of any n-fold almost uniform par- 
tition (not necessarily the same from one context to another), then 


(3) (r — 1)/n S PCr, n)) S (+ I)/n. 


3. Let k(B, n) denote the largest integer r (possibly zero) such that 
some Cr, n) is not more probable than B. The function k(B, n) is 
well-defined, and 0 < k(B, n) < n. 

4a. For any P that almost agrees with <-, 


(4) (k(B, n) — 1)/n < P(B) S (K(B, n) + 2)/n. 


4b. At most one P can almost agree with <- 

5a. If B; and C; are n-fold partitions (not necessarily almost uniform) 
so indexed that B, <- By <---- <+B,, and Cy >: Cy, >+--- > Cn; 
then 


(5) UB > UC, 1r=0,---,n-1. 


+ Technical note: The mathematical essence of the terminal conclusion of this 
theorem, and other conclusions related to it, are given by Sobczyk and Hammer 
[S15]. It might be conjectured, in analogy with countably additive measures, that 
this conclusion means only that P is non-atomic, but that conjecture is false [N5].+ 

+ A key reference for further information on the strueture of finitely addi- 

-tive measures is (Duhins 1969). Sustained use of finitely additive prohahility 
is illustrated in (Dubins and Savage 1965). 
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5b. If in addition the two partitions are almost uniform, then 


r r+2 
(6) Uc<:U B, r=1,---,n—2. 
1 1 


r+2 n 


(Proof. U Bs >-U Be >- UG >-U CG.) 
1 n—r n—r 1 


oc. The union of any r elements of one almost uniform n-fold parti- 
tion is not more probable than the union of any r + 2 elements of an- 
other. 

5d. If BM C = O, then 


(7) k(B,n) + k(C,n) —-2<Sk(BUC,n) < kB, n) + k(C, n) +1. 
6a. If a C(r, m) is not more probable than a C(s, n), then 
r—2 s+ 2 1 
(8) (—) <()+—. 
m n mn 
(Consider an mn-fold almost uniform partition, and use the easily es- 


tablished fact that the union of any ¢ + 2 elements of an almost uni- 
form partition is actually more probable than that of any ¢ elements.) 


k(B, m k(B, n 3. 83 1 
" KB, m) Bn) | 33 1 
m n mn mn 
6c. It 1s meaningful to define P(B) by 
_ kB, n) 
(9) P(B) >= pf lim ’ 
n> 0 n 


that is, the limit exists. 

7. P(B), as just defined, is a probability measure, and the only one 
that almost agrees with <-. 

8a. There exist two infinite sequences of sets C,, and D, contained 
in B such that: 


1. Cr N Dy = 0, 
2. Cy Cc CH 445 and dD, C Dn41, 
3. P(C,) > pP(B) — n“, 

4. P(D,) > (1 — p)P(B) — n7. 


8b. P(Un Cn) > pP(B), P(Un Dn) = (1 — p)P(B), and (Un Cn) A 


8c. P(U.C,) = pP(B). @ 


A few technical terms of localized interest only are now introduced. 
If and only if, for every B >- 0, there is a partition of S, no element of 
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which is as probable as B; <: is fine.+ Band C are almost equiva- 
lent, written B =- C; if and only if for all non-null G and A such that 
BNG=CNH=0, BUG>-C andCUH>-B. It is obvious 
that equivalent events are also almost equivalent. Finally, if and only 
if every pair of almost equivalent events are equivalent, <:- is tight. 


THEOREM 3 
Hyp. <- is fine. 


CoNnclL. 1. If B>-0, and C >- 0; there exists DCC such that 
0<-D<-B. 

2,.1f B=-G, C=-H, and BNC=GNH=0; then BUC 
= GU H. 

38. 1 B=-CiG=H,BUCz=-GUH,andBNC=GNA =0; 
then B =- G. 

4, Any partition of S into almost equivalent events is an almost uni- 
form partition. 

5. Any event can be partitioned into two almost equivalent events. 

6. Any event can be partitioned into 2” almost equivalent events, 
for any non-negative integer n. 

7. There exists one and only one P that almost agrees with <.. 
For any B, p (0 < p < 1), and the unique P just defined, there ex- 
ists CC B such that P(C) = pP(B). If B>-0, P(B) > 0. Finally, 
B =-C, if and only if P(B) = P(C). 


Proor. The parts of the conclusion are so arranged that each is easy 
to prove in the light of its predecessors, but proofs for Parts 3 and 5 
are given below. It may be remarked that all parts are trivial conse- 
quences of the last one and have therefore relatively little importance in 
themselves. 

Part 3. Suppose, for example, BUE<-G, BN EHE=0, and 
E >-0; and consider two cases: 

(a) If BUC <-S, it may be assumed without loss of generality 
that C N EF = 0, whence (B UC) UE >-GU 4H. Therefore, C >- H. 

Let E be partitioned into two non-null events £, and E2; then (since 
it is absurd to suppose that the part of G outside of C is null, which 
would imply C >-G >-B U E) there is in G an EL’ such that C | E’ 
=O0<-H’<-Ey Now CUEH’>- HUE >-G>-(BU £) U Eo, 
whence C >: B U Ey, which is absurd. 

(b) If BU C =-S, it can (setting aside the easy special case CN G 
=-(Q) be shown successively that: HUG=-S; C<-BUE <:G, 
where E>-0 and ECCNG; (BN A)DUE<-(GNC); CN A) 
<- (G1 B); and H U E <:G, which establishes a contradiction. 


+ In the first edition, this definition was a trifle too weak, as pointed out by 
Malcolm Pike. 


38 PERSONAL PROBABILITY [3.3 


Part 5. There exists a sequence of threefold partitions of B, say 
C,, Dn, and G,, such that: 


1.C, UG, >+Dn, and D, U G, >: Ch, 

2. Cn41 = C Dns — D,, and Gn41 cS Gn, 

3. ~Ga4i NM Gr =*Gr41; whence G,- contains two disjoint events 
each at least as probable as G,,41. 


For any H >-0, G, <-H for sufficiently large n, as may be seen by 
considering some m-fold partition no element of which is more probable 
than H, and letting n be such that 2"! > m. If G, were more probable 
than H and therefore more probable than each element of the partition, 
it would follow that the union of all elements of the partition, namely 
S, is less probable than G,, which would be absurd. 

The two events B; = UnCn, Bo = (Un Dna) U (1a Ga) partition B 
in the required fashion. @ 


CoROLLARY 1 If <- is both fine and tight; the only probability 
measure that almost agrees with <- strictly agrees with it, and there 
exist partitions of S into arbitrarily many equivalent events. 


THEOREM 4 <- is both fine and tight, if and only if, for every B <- C, 
there exists a partition of S the union of each element of which with B 
is less probable than C. 


The proof of this theorem is easy. 


In the light of Theorems 3 and 4, I tentatively propose the following 
postulate, P6’, governing the relation < among events, and thereby 
the relation < among acts. 


P6’ If B < C, there exists a partition of S the union of each ele- 
ment of which with B is less probable than C. 


It seems to me rather easier to justify the assumption of P6’, which 
says in effect that < is both fine and tight, than to justify the assump- 
tion, which was made by de Finetti [D2] and by Koopman [K9], [K10], 
[K11] in closely related contexts, that there exist partitions of S into 
arbitrarily many equivalent events, though logically P6’ implies that 
assumption and somewhat more. Suppose, for example, that you your- 
self consider B < C, that is, that you would definitely rather stake a 
gain in your fortune on C’ than on B. Consider the partition of your 
own world into 2” events each of which corresponds to a particular 
sequence of n heads and tails, thrown by yourself, with a coin of your 
own choosing. It seems to me that you could easily choose such a 
coin and choose n sufficiently large so that you would continue to pre- 
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fer to stake your gain on C, rather than on the union of B and any par- 
ticular sequence of n heads and tails. For you to be able to do so, you 
need by no means consider every sequence of heads and tails equally 
probable. 

It would, however, be disingenuous not to mention that some who 
have worked on a closely related concept of probability, notably Keynes 
[K4] and Koopman [K9], [K10], [K11], would object to P6’ precisely 
because it implies that the agreement between numerical probability 
and qualitative probability is strict. Koopman, for example, holds 
that, if Ad > B and A = B, then A is necessarily more probable than 
B, though the numerical probability of A may well be the same as that 
of B. Thus, if a marksman shoots at a wall, it is logically contradictory 
that his bullet should fall nowhere at all, but it is logically consistent 
that a prescribed mathematically ideal point on the bullet should strike 
a prescribed mathematically ideal line on the wall. Since the event of 
the prescribed point hitting a prescribed line is logically possible, Koop- 
man would insist that the event is more probable than the vacuous 
event, namely that the bullet goes nowhere, though the numerical proba- 
bility of both events is zero. I do not take direct issue with Koopman, 
because he is presumably talking about a somewhat different concept 
of probability from the particular relation <; but I do not think it 
appropriate to suppose that the person would distinctly rather stake a 
gain on the line than on the null set. The issue is not really either an 
empirical or a normative one, because the point and line in question 
are mathematical idealizations. If the point and line are replaced by a 
dot and a band, respectively, then, of course, no matter how small the 
dot and band may be, the probability of the one hitting the other is 
greater than that of the vacuous event. But it seems to me entirely 
a matter of taste, conditioned by mathematical experience, to decide 
what idealization to make if the dot and band are replaced by their ideal- 
ized limits. So much for hair splitting. 

As far as the theory of probability per se is concerned, postulate P6’ 
is all that need be assumed, but in Chapter 5 a slightly stronger assump- 
tion will be needed that bears on acts generally, not only on those very 
special acts by which probability is defined. Therefore, I am about to 
propose a postulate, P6, that obviously implies P6’ and will therefore 
supersede it. This stronger postulate seems to me acceptable for the 
same reason that P6’ itself does. 


P6 If g < h, and f is any consequence; then there exists a parti- 
tion of S such that, if g or h is so modified on any one element of the 
partition as to take the value f at every s there, other values being un- 
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disturbed; then the modified g remains less than h, or g remains less 
than the modified h, as the case may require. 


4 Some mathematical details 


Are there qualitative probabilities that are both fine and tight, that 
are fine but not tight, that are tight but not fine, that are neither fine 
nor tight but do have one and only one almost agreeing probability 
measure? Examples answering all these questions in the affirmative 
will be exhibited in this section. 

To indicate a different topic that will also be treated here, those of 
you who have had more than elementary experience with mathematical 
treatments of probability know that it is not usual to suppose, as has 
been done here, that all sets have a numerical probability, but rather 
that a sufficiently rich class of sets do so, the remainder being consid- 
ered unmeasurable. Again, it is usual to suppose that, if each of an 
infinite sequence of disjoint sets is measurable, the probability of their 
union is the sum of their probabilities, that is, probability measures 
are generally assumed to be countably additive. But the theory being 
developed here does assume that probability is defined for all events, 
that is, for all sets of states, and it does not imply countable additivity, 
but only finite additivity. The present section not only answers the 
questions raised in the preceding paragraph, but also discusses the re- 
lation of the notions of limited domain of definition and of countable 
additivity to the theory of probability developed here. The general 
conclusions of this discussion are: First, there is no technical obstacle 
to working with a limited domain of definition, and, except for exposi- 
tory complications, it might have been mildly preferable to have done 
so throughout. Second, it is a little better not to assume countable 
additivity as a postulate, but rather as a special hypothesis in certain 
contexts. A different and much more extensive treatment of these 
questions has been given by de Finetti [D4]. 

Finally, before entering upon the main technical work of this sec- 
tion, one easy question about the relation between qualitative and 
quantitative probability will be answered and several as yet unanswered 
ones will be raised. 

Are there qualitative probabilities without any strictly agreeing meas- 
ure? Yes, because any qualitative probability that is fine but not 
tight is easily shown to provide an example. It is, however, an open 
question, stressed by de Finetti [D5], whether a qualitative probability 
on a finite S always has a strictly agreeing measure. It would also be 
technically interesting to know about the existence of almost agreeing 
measures in the same context.t+ 


+ Even this has since been answered in the negative by Kraft, Pratt, and 
Seidenherg (1959). See also (Fishhurn 1970, pp. 210-211). 
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The matters to be treated in the rest of this section are rather tech- 
nical mathematically, and, though I would not delete them altogether, 
it does not seem justifiable to lay the necessary groundwork for pre- 
senting them in an elementary fashion. Some may, therefore, find it 
necessary to skip the rest of this section altogether, or to skim it rather 
lightly. 

It is well known that there does not exist a countably additive proba- 
bility measure defined for every subset of the unit interval, agreeing 
with Lebesgue measure on those sets where Lebesgue measure is de- 
fined, and assigning the same measure to each pair of congruent setst 
(Problem (b), p. 276 of [H2]). On the other hand, there do exist finitely 
additive probability measures agreeing with Lebesgue measure on those 
sets for which Lebesgue measure is defined, and assigning the same 
measure to each of any pairs of congruent sets; cf. p. 32 of [B4]. The 
existence of such measures shows, among other things, that a finitely 
additive measure need not be countably additive. Again, calling such 
a finitely additive extension of Lebesgue measure P and defining B <- C 
to mean P(B) < P(C), we see an example of a qualitative probability 
that is both fine and tight. 

An example of a qualitative probability that is tight but not fine may 
be constructed by taking for S two unit intervals, S; and So, in each 
of which finitely additive extensions of Lebesgue measure, P; and Po, 
are defined. The generic set B in this example is therefore partitioned 
into B; = Bf) 8, and By = B f) Se, respectively. For this example, 
let B <-C; if, and only if P,(B,) < P,(C,), or else P,(B,) = Pi(C,), 
and P2(Be) < Pe(Ce). This <:- is not fine, because, for example, S 
cannot be partitioned into events none of which is more probable than 
So. On the other hand, it is easily seen to be tight. 

Next, take S to be the union of S; and S2 with the measures of P, 
and P». as defined in the preceding example, but modify the definition 
of <-, saying B <-C; if and only if P,(B,) + Pe(Be) < Pi(Cy) + 
P2(C2), or else P,(B1) + Po(Bo) = Pi(Ci) + Pe(Ce), and P;(B1) < 
P,(C,). This is an example of a qualitative probability that is fine but 
not tight. 

Combining the ideas of the two preceding examples, it is easy to ex- 
hibit a qualitative probability that is neither fine nor tight but is such 
that S can be divided into arbitrarily many equally probable events. 
Thus all the questions raised in the opening paragraph of this section 
are answered in the affirmative. 


+ §. Ulam (1930) proves that any nonatomiec, countably additive probability 
measure defined on all the subsets of the unit interval is inconsistent with the 
eontinuum hypothesis. 


42 PERSONAL PROBABILITY [3.4 


To get a feeling for the question whether literally all sets should be 
regarded as measurable, suppose that S is a cube of unit volume and 
that the probability measure P that strictly agrees with < is such that 
the probability of a parallelepiped is equal to its volume. It follows 
that the probability of any set having Jordan content is its Jordan 
content, but, if a set has not Jordan content, a continuum of possibili- 
ties is still open. Though other possibilities are conceivable, it is not 
unnatural to consider an idealized person for whom the numerical prob- 
ability attached to each Borel set, or even each Lebesgue measurable 
set, is its Lebesgue measure. To go further and take seriously compari- 
sons between sets that are not Lebesgue measurable, or even between 
those that are not Borel measurable, seems to me to be without any 
Implication bearing on reality. I suppose it might be argued, on the 
contrary, that there is no feature of reality that can properly be inter- 
preted by postulating that the person is able to compare only sets from 
a sufficiently narrow field, so that it is simpler and more elegant to ad- 
mit all sets. The question seems to be one of taste, but the following 
remark illustrates what I consider an awkwardness in supposing proba- 
bility to be attached to all sets. It would seem, at first glance, that the 
person should be able, if he is so constituted, to regard all pairs of geo- 
metrically congruent sets for which he makes any comparison at all as 
equivalent, but the famous Banach-Tarski paradox [B5] shows that 
this cannot be done if all sets are regarded as measurable. I think it a 
little more graceful to abstain from comparison between the more bi- 
zarre sets than to give up, or even much modify, my everyday notions 
about the symmetry of such probability problems associated with 
geometry. 

If one is unwilling to insist on comparison between every pair of 
sets, or events; then, in the same spirit, it is inappropriate to insist on 
comparison between every pair of acts. All that has been, or is to be, 
formally deduced in this book concerning preferences among sets, could 
be modified, mutatis mutandis, so that the class of events would not 
be the class of all subsets of S, but rather a Borel field, that is, a o-alge- 
bra, on S; the set of all consequences would be a measurable space, 
that is, a set with a particular o-algebra singled out; and an act would 
be a measurable function from the measurable space of events to the 
measurable space of consequences. Indeed, the whole thing could be 
done for abstract o-algebras without reference to sets at all, and this 
might have some actual advantage, since it would make possible the 
identification of events with propositions in almost any formal language, 
even one unable to formulate at all the complete descriptions I call 
states. 
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It may seem peculiar to insist on o-algebras as opposed to finitely 
additive algebras even in a context where finitely additive measures are 
the central object, but countable unions do seem to be essential to some 
of the theorems of § 3—for example, the terminal conclusions of Theo- 
rem 3.2 and Part 5 of Theorem 3.3. 

So much of the modern mathematical theory of probability depends 
on the assumption that the probability measures at hand are countably 
additive that one is strongly tempted to assume countable additivity, 
or its logical equivalent, as a postulate to be adjoined to P1-6.*+ But I 
am inclined to agree with de Finetti [D2], [D4] and Koopman [K9], 
[K10], [K11] that, however convenient countable additivity may be, 
it, like any other assumption, ought not be listed among the postulates 
for a concept of personal probability unless we actually feel that its 
violation deserves to be called inconsistent or unreasonable. I know of 
no argument leading to the requirement of countable additivity, and 
many of us have a strong intuitive tendency to regard as natural proba- 
bility problems about the necessarily only finitely additive uniform 
probability densities on the integers, on the line, and on the plane. It 
therefore seems better not to assume countable additivity outright as a 
postulate, but to recognize it as a special hypothesis yielding, where 
applicable, a large class of useful theorems. 


5 Conditional probability, qualitative and quantitative 


Conditional preferences among acts in the light of a given event were 
introduced in § 2.7. Since the relation < among events has been de- 
fined in terms of the corresponding relation among acts, we may well 
expect to attach meaning to statements of the form B < C given D, 
provided that D is not null. The natural way to do so is to take a pair 
of acts f and g that test whether B < C (as prescribed by the definition 
of < between acts in § 2) and say that B < C given D, if and only if 
f < g given D. Since there is more than one pair of acts f, g by which 
the proposition B < C' can be tested, it is at first sight conceivable that 
not all such pairs would be in the same order given D, which would frus- 
trate the proposed definition of < given D. However, it is easily seen 
that for any f, g testing B < C, f < g given D (D not null) is equiva- 
lent toBN D<CND. Thus it is seen not only that the proposed 
definition is unambiguous, but also that it is expressible in terms of 
probability comparisons among sets, without direct reference to acts 
at all, and, still further, that the postulates P1-6 apply to the condi- 
tional preference relation < given D among acts. This preamble suff- 
ciently motivates the following definition and easy theorem about qualli- 
tative probability relations generally. 


1 Carried out by Villegas (1964). 
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If <-+ is a qualitative probability, and 0 <- D; then B <-C given 
D, if and only if BN D<-C Nf D. 


THEOREM | If <: is a qualitative probability, then so is <+ given 
D. If in addition <: is fine or tight, then <: given D is correspondingly 
fine or tight. 


If <: is fine, then, for any D that is not null, there exists, in view of 
Theorem 3.3, one and only one probability measure P(B| D), the 
(conditional) probability of B given D, that almost agrees with <-. 
But, just as one would expect from the traditional study of numerical 
probability, and as may be easily verified, P(B M D)/P(D) considered 
as a function of B for fixed D is a probability measure that almost 
agrees with <- given D. Therefore, 


(1) P(B| D) = P(B N D)/P(D). 


As was explained in § 2.7, preference among acts given B can sug- 
gestively be expressed in temporal terms. Analogously, the comparison 
among events given B and, therefore, conditional probability given B 
can be expressed temporally. Thus P(C | B) can be regarded as the 
probability the person would assign to C' after he had observed that B 
obtains. It is conditional probability that gives expression in the theory 
of personal probability to the phenomenon of learning by experience. 

In accordance with established usage, a pair of events B, C' are called 
independent if P(B N C) = P(B)P(C). More generally, a set of events 
are called independent, if for every finite set of them, say B,, ---, Bn, 


(2) P (f):B:) = I]; PB). 


Obviously, if D is not null, B and D are independent; if and only if 
P(B| D) = P(B), in which case D may fairly be called irrelevant to B. 

The notions of independence and irrelevance have, so far as I can 
see, no analogues in qualitative probability; this is surprising and un- 
fortunate, for these notions seem to evoke a strong intuitive response. 
The absence of these analogues is traceable to the absence of a qualita- 
tive analogue for propositions of the form P(B | C) < PG | H). Work- 
ing under a rather different motivation from that which guides this 
book, B. O. Koopman [K9], [K10], and [K11] has developed a system of 
qualitative possibility in which it is meaningful to compare B given C 
with G given H. It is true also that for qualitative probability, even as 
it is defined here, some interconditional comparisons might be natu- 
rally defined. If, for example, B <-~B given C' and ~G <:G given 
H, it would not be unreasonable to establish the convention that B 
given C <:-G given H. This sort of extension is not, however, highly 
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pertinent to my purpose, for here I have little interest in qualitative 
probabilities, except as a foundation for quantitative probability. 
The following partition formula is well known and easy to prove: 


(3) PC) = 2, PC| B)P(B)) 
where B, is a partition of S into non-null sets. If, further, C' is not null, 
it is also trivial to derive the celebrated Bayes’ rule (or theorem), 
P(C | B;)P(B) 

P(C) 
__PC| B)P(B)) 
EPC B)P(B;) 


(4) P(B;|C) = 


Illustrations of these formulas are found in all elementary texbooks on 
probability, as well as in later sections of this book. 
Finally, if neither B nor C is null, 


P(B) P(C) P(B)P(C) 

which may be given the suggestive reading: Knowledge of C' modifies 
the probability of B by the same factor by which knowledge of B modi- 
fies the probability of C. 

The concept of random variable enters into almost any discussion of 
probability. Experts are fairly well agreed on the following definition. 
A random variable is a function x attaching a value x(s) in some set 
X to every s in a set S on which a probability measure P is defined. f 
Such an S together with the measure P is called a probability space. 

Real-valued random variables are the most familiar, though in gen- 
eral the values X can be things of any sort. If, for example, x and y, 
with values in X and Y, respectively, are random variables on the 
same measure space, a new random variable z = {x, y} is defined by 
setting z(s) = {xr(s), y(s)}. The values of z are thus elements of what 
is called X * Y (read the cartesian product of X and Y), the set of 
ordered pairs with first element in X and second in Y. The same sort 
of thing can be done, of course, for ordered n-tuples and also for infinite 
sequences of random variables. 


+ In many applications of the theory of probability, not all subsets of S or of X 
are considered measurable. It is then required as part of the definition of random 
variable that x be measurable, i.e., that for every measurable Y Cc X, the set of 
s’s such that z(s) e Y be measurable. 
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Two random variables x and y defined on the same measure space S 
are called (statistically) independent; if and only if, for every Xp C X 
and Yo C Y, the two events (i.e., subsets of S) defined by the condi- 
tions z(s) ¢ Xo and y(s) «Yo, respectively, are independent.t The 
extension of this definition from pairs to any number of random variables 
is obvious. 


6 The approach to certainty through experience 


In § 3, the theory of personal probability was, from the purely math- 
ematical point of view, reduced to that of probability measures, a sub- 
ject that has been elaborately studied, more or less explicitly, for cen- 
turies. Any mathematical problem concerning personal probability is 
necessarily a problem concerning probability measures—the study of 
which is currently called by mathematicians mathematical probability 
—and conversely. The particular outlook and interpretation implicit 
in a personalistic concept of probability leads, however, to problems 
that, though perfectly meaningful for mathematical probability, might 
not otherwise have been emphasized. This section and the succeeding 
one each briefly discuss one such problem. ‘These two problems are 
selected from among many possibilities for the insight they provide 
into the concept of personal probability. 

Before studying these problems, it is necessary to be conversant with 
the material in Appendixes 1 and 2, which is used in the immediate 
sequel and often throughout the rest of this book. 

As was brought out in §5, the person learns by experience. The 
purpose of the present section is to explore with a moderate degree of 
generality how he typically becomes almost certain of the truth, when 
the amount of his experience increases indefinitely. To be specific, 
suppose that the person is about to observe a large number of random 
variables, all of which are independent given B; for each 2, where the 
B; are a partition of S. It is to be expected intuitively, and will soon 
be shown, that under general conditions the person is very sure that 
after making the observation he will attach a probability of nearly 1 to 
whichever element of the partition actually obtains. 

To describe the situation formally, let B; be a partition of S with 
P(B;) = B(2). Let x,,r = 1, 2, ---, be a sequence of random variables, 
each taking on only a finite number of values (which can without loss 
of generality be thought of as integers). The restriction to a finite set 
of values could be removed, but to do so would raise problems of mathe- 
matical technique that, however interesting, are rather beside the point 

+ Where not all sets are measurable, Xo and Yo must, of course, be required to 
be measurable. 
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of this book. Let x denote the first n of the random variables x,. It is 
to be borne in mind that x depends on 7, so, strictly speaking, it should 
be written x(n). The assumption that, given B, the x,’s all have the 
same distribution is expressed by 


(1) P(2,(s) = 2,| Bi) = &(2,| 4), 
where (2, | 1) is defined by the context. Combining (1) with the as- 
sumption that the x,’s are independent given B,, 


(2) P(x| By =p P(x(s) = {x1, «++, tn} | BY = I] &(z-| 0), 
r=] 


where a conventional symbol has been used for equal by definition. 
These hypotheses having been laid down, it follows from Bayes’ rule 
and the partition formula (5.3) and (5.2), that 


P(x | Bi) P(B) 


(3) P(B;| 2) = ae 
and ae) 
(4) P(x) = 22 6) IT &@,| 0). 


In connection with (3), it may be observed in passing that, if the a priori 
probability, (7), of B; is 0, then, no matter what value x is observed, 
the a posteriori probability of B;, P(B;| x), is also 0. This is an ex- 
ample of the general principle that, if some event is regarded as vir- 
tually impossible, then no evidence whatsoever can lend it credibility. 
Similarly, (3) implies the equally common-sense principle that, if an 
observation zx is virtually impossible on the hypothesis (i.e., given) 
B,, and x is observed, then B; becomes virtually impossible a posteriori. 

It is particularly interesting to compare the probability of two ele- 
ments of the partition, say B, and Bz for definiteness, in the light of z. 


P(Bi| 2) _ BQ) y_ &(@r| 1) 
P(Bz|2) (2) “> E(x | 2) 


(5) 
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where self-explanatory abbreviations have been introduced. Equation 
(5) is meaningless, if both the numerator and denominator of its left- 
hand side vanish. If the denominator alone vanishes, the fraction may 
properly be regarded as infinite. This will happen; if and only if Bo is 
null, and B, is not null given x. That is, it will happen if and only if 
B(1) # 0, B(2) = 0, or if B(1) ¥ O, and R(x) = ~. 

In modern statistical usage, R’(xz,) and R(x) are the likelihood ratios 
of B, to Bz given x, and z, respectively, quantities of importance in 
many theoretical contexts. 

If a person contemplates making the observation x, that is, finding 
out the value of z(s) for the s that is the true state of the world, it may 
properly be asked how probable he considers it that R will turn out to 
have a particular value. It will be shown, barring two banal excep- 
tions, that, for n sufficiently large, the probability, given B,, that F is 
greater than any preassigned number is almost 1. The possibility 
P(B,) = 0 is to be excepted, for then the conditional probability in 
question is meaningless. The other exception occurs when E(x, | 1) = 
(2, | 2) for every z,, that is, when the common distribution of x, given 
B, is the same as it is given B2; for then observation of x, is simply 
irrelevant in distinguishing B, from Bg, or, a little more technically, x, 
is irrelevant to B, given B, U Bo, and 


(6) P(R'(zr) = 1| By) = 1. 


Formally, it is to be demonstrated that, unless P(B,) = 0, or (6) 
holds, 


(7) lim P(R(z)>p|B)=1 #£for0<p<-. 
The problem is quite simple when account is taken of the fact that 
R(x) is the product of n random variables, R’(x,), that are independent 
given B,. In attacking the problem, two cases are to be distinguished, 
according as there are or are not values of x that have positive proba- 
bility given B, but zero probability given Bo. 

It is in practice rather fortunate to find instances of the first case, 
for then (7) applies with a vengeance. Indeed, suppose that 


(8) P(R'(tr) <0 | By) =¢, 6 <1. 
Then 
(9) P(R = ©|B,) = 1 - @¢”, 


which obviously approaches 1 with increasing n. 
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The second case, namely ¢ = 1, is more interesting. Since much is 
known about sums of identically distributed independent random varia- 
bles, it is natural to investigate 


(10) log R(x) = 2D) log R’(z,), 


thereby replacing a product by a sum. It is easily seen from the defi- 
nition of R’(z,) that P(R’(z,) > 0 | B,) = 1, so, in the case now at 
hand, the functions log R’(x,) are independent real bounded random 
variables. 


Letting 
(11) I = E(log R’(z,) | Bi), 
the weak law of large numbers { implies that, for any e > 0, 
(12) lim P(log R(z) > n(I — e)| Bi) = 1, 
equivalently, —_ 
(13) lim P(R(x) > e"7-°? | By) = 1. 


The objective will therefore be achieved, if it is demonstrated that 
I >0 unless (6) holds. But 


(14) I = E(log R’(z,) | B,) 
— log E(R’~*(z,) | Bi) 
— log 1 = 0, 


IV 


as may be argued thus: The inequality in the above calculation is as- 
signed as Exercise 8 in Appendix 2, together with the fact that equality 
can hold in (14) if and only if R’~!(x,) is constant with probability 
one given B;. But the expected value of R’~'(x,) given B, is equal to 
1, as (14) asserts and as may be easily verified from the definition of 
R’~'(x,). So, barring the exceptions provided for, J > 0, and the 
demonstration of (7) is complete. 

Before the observation, the probability that the probability given x 
of whichever element of the partition actually obtains will be greater 
than a is 


(15) dB) P(P(B; | x) > a| Bi), 


where summation is confined to those 7’s for which 6(7) # 0. Applica- 
tion of (14) (extended to arbitrary pairs of 7’s) shows that the coefficients 


| For the definition of this law, see, if necessary, p. 191 of Feller’s book [F1]. 
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of each @(7) in the quantity (15), and therefore the quantity itself, ap- 
proaches 1 as 7 increases; provided only that no two functions é(x, | 1) 
and £(x, | a’) are the same, if 6(2) and B(2’) are both different from zero. 

To summarize informally, it has now been shown that, with the ob- 
servation of an abundance of relevant data, the person is almost cer- 
tain to become highly convinced of the truth, and it has also been shown 
that he himself knows this to be the case. 

It may be remarked, for those familiar with certain theorems, that 
many refinements of (7) and its consequences could be worked out by 
application of the strong law of large numbers, the central limit theo- 
rem, and the law of the iterated logarithm to R’(x,). 

The quantity J is coming to be called the information of the distri- 
bution of x, given B, with respect to the distribution of x, given Bo. 
More generally, if P and Q are probability measures, confined (for sim- 
plicity) to a finite set X with elements z; the information of P with 
respect to Q is defined by 


(16) X P(x) log —— is A 


Q(x) 


This usage stems from work of Claude Shannon in communication en- 
gineering, a good account of which is given in [S11]; and also from inde- 
pendent work of Norbert Wiener in a related context [W10]. The ideas 
of Shannon and of Wiener, though concerned with probability, seem 
rather far from statistics. It is, therefore, something of an accident 
that the term “information’’ coined by them should be not altogether 
inappropriate in statistics. The situation is still further confused, be- 
cause, as long ago as 1925, R. A. Fisher emphasized an important no- 
tion, which he called “information,” in connection with the theory of 
estimation (Paper 11, Theory of statistical estimation in [F6]). At first 
glance, Fisher’s notion seems quite different from that of Shannon and 
Wiener, but, as a matter of fact, his is a limiting form of theirs. A 
useful but rather technical exposition relating the several senses of ‘‘in- 
formation” is given by Kullback and Leibler [K15], and I return to the 
topic in § 15.6.+ 


7 Symmetric sequences of events 


A problem often posed by statisticians is to estimate from a sequence 
of observations the unknown probability p that repeated trials of some 
sort are successful. On an objectivistic view, this problem is natural 
and important, for on such a view the probability that a coin falls heads, 
for example, is a property of the coin that can be determined by ex- 
perimentation with the coin and in no other way. But on a personalistic 


| See also (Kullback 1961). 
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view of probability, strictly interpreted, no probability is unknown to 
the person concerned, or, at any rate, he can determine a probability 
only by interrogating himself, not by reference to the external world. 

This situation has been interpreted to imply that the personalistic 
view is wrong, or at any rate inadequate, because it apparently cannot 
even express one of the most natural and typical problems of statistics. 
Thus far in this book, I have not argued against the possibility of de- 
fining some useful notion of objective probability, but have contented 
myself with presenting a particular notion of personal probability. 
Therefore, at this point it might be tempting to seek a dualistic theory 
admitting both objective and personal probabilities in some kind of ar- 
ticulation with one another. De Finetti [D3] has shown, however, 
that it is not necessary to do so, that the notion of a coin with unknown 
probability p can be reinterpreted in terms of personal probability 
alone. 

The present section is devoted to outlining this development due to 
de Finetti. In the organization of the book as a whole, it plays no logi- 
cally essential part; it is, rather, a digression intended to give a clearer 
understanding of the notion of personal probability, especially in rela- 
tion to objectivistic views. The ideas presented here are but a frag- 
ment of those on the same subject in [D2]. 

Let x, be a sequence of random variables taking only the values 0 
and 1. The x,’s are, to all intents and purposes, a sequence of events, 
the rth of which is the event that z,(s) = 1. To say that these events 
are independent, each occurring with probability p, is to say that the 


probability of any finite pattern, 71, ---, 2%,, initiating the sequence 
z,(s) is given by the formula 

(1) P(z,(s) = itr, r= 1, porn n | D) = pi (1 eee py 

where y is the number of 1’s among the 2,’s for r = 1, ---, n. 


Mixtures, in a certain sense, of sequences of random variables are 
often of interest, as they already have been in the preceding section. 
Suppose, to be explicit, that the world is partitioned by B; and that, 
given B,, the x,’s are independent with P(z,(s) = 1 | B;) having some 
fixed value p(z). Then the unconditional probability of a particular 
initial sequence is a mixture of the probabilities given by (1) thus: 


(2) P(a,(s) = a3 7 = 1, +++, 0) = 2 p@*(1 — p@))" *P(B,). 


It is natural to generalize (2) formally thus: 


8) Play(s) = ar = 1, ++, m) = fh — py aM), 
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where JM is a probability measure on the real numbers in the interval 
[O, 1]. 

It is noteworthy that equation (8), understood to apply for every n, 
is equivalent to the condition that the probability that every n of each 
prescribed set of n of the x,’s takes the value 1 is 


(4) t p" dM (p). 


This follows by arithmetic induction from the obvious formula 
(5) P(a,-(s) = a%3r= 1, ---, n) 
= P(a,(s) = tr5r = 1, +++, M5 tn4i(8) = 0) 
+ P(@-(s) = 2,37 = 1, +++, 0; tris) = 1), 


which applies to any sequence of random variables taking on only the 
values 0 and 1. 

Equation (3) can very well have an interpretation in such terms that 
the measure M is not merely an abstract probability measure, but is 
actually a personal probability. Thus, if p is a random variable that 
is (for a given person) distributed according to M, and, if for each p 
the conditional distribution of the x,’s given p is independent, with 
P(a,(s) = 1) = p; then (3) obtains. Strictly speaking, the notion of 
conditional probability as it occurs in the preceding sentence is used in 
a somewhat wider sense than has been defined in this book, for the 
probability of any particular p will typically be zero. At least for 
countably additive measures, the necessary extension of conditional 
probability and conditional expectation is presented by Kolmogoroff in 
[K7]; it is a concept of the greatest value in advanced mathematical 
statistics and in probability generally. 

However, in most contexts where objectivists speak of an unknown 
probability p, there is, so far as an exclusively personalistic view of 
probability is concerned, no unknown parameter that can play the role 
of p in (3). 

Examination of situations in which “unknown”’ probability is ap- 
pealed to, whether Justifiably or not, shows that, from the personalistic 
standpoint, they always refer to symmetric sequences of events in the 
sense of the following definition. The sequence of random variables 
x,, taking only the values 0 and 1, is a symmetric ¢ sequence, if and only 
if the probability that any b of the z,(s)’s equal 1 and any c other 
x,(s)’s equal 0 depends only on the integers b and c. 

+ De Finetti uses the French word for “equivalent.’’+ 


+ He and others now prefer “exchangeable.” The concept seems to have heen 
first suggested bv Jules Haag (1928). 
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It is easy to verify that any mixture of independent sequences in the 
sense of (3) is a symmetric sequence. De Finetti has discovered that 
the converse is also true. These conclusions can be formally summarized 
thus: 


THEOREM 1 A sequence of random variables x,, taking only the 
values 0 and 1, is symmetric, if and only if there exists a probability 
measure M on the interval [0, 1] such that the probability that any pre- 
scribed n of the 2,(s)’s equal 1 is given by (4). Two such measures, M 
and M’, must be essentially the same,f in the sense that, if B is a sub- 
interval of [0, 1], then M(B) = M’(B). 


Considering that de Finetti has published a proof of Theorem 1 in 
[D2] based on the Fourier integral, that any proof of it must be rather 
technical, and that the theorem is not the basis of any formal inference 
later in this book, it seems best not to prove it here.f 

It is Theorem 1 that makes it possible to express propositions re- 
ferring to unknown probabilities in purely personalistic terms. If, for 
example, a statistician were to say, “I do not know the p of this coin, 
but I am sure it is at most one half,” that would mean in personalistic 
terms, “I regard the sequence of tosses of this coin as a symmetric se- 
quence, the measure M of which assigns unit measure to the interval 
(0, 4].” This condition on M means in turn that for every n the (per- 
sonal) probability of n consecutive heads is at most 2~”, as is easily 
verified. I do not insist that propositions couched in terms of a ficti- 
tious unknown probability are bad, if understood as suggestive abbrevi- 
ations, but only that the meaningfulness of such propositions does not 
constitute an inadequacy of the personalistic view of probability. 

The mathematical concept of probability measure or, a trifle more 
generally, bounded measure is fundamental to mathematics generally. 
Probability measures, often under other names, are, therefore, em- 
ployed in many parts of pure and applied mathematics completely un- 
related to probability proper. For example, the distribution of mass 
in a not necessarily rigid body is expressed by a bounded measure that 
tells how much of the body is in each region of space. We must, there- 
fore, not be surprised if, even in studying probability itself, we come 
across some probability measures used not to measure probability 


+ Technical note: If ‘probability measure’’ were here understood to mean a count- 
ably additive probability measure on the Borel sets of [0, 1], the theorem would re- 
main true, and the essential uniqueness of 4 would become true uniqueness. 

¢t Technical note: Theorem 1 can be proved very quickly and naturally by apply- 
ing the theory of the Hausdorff moment problem (pp. 8-9 of [S13]) to M, but this 
method does not seem to generalize readily.+ 

+ New and general methods are in Hewitt and Savage (1955) and Ryll- 
Nardzewski (1957). For related work see Biihlmann (1960), Freedman (1962, 
1963), Milier-Gruzewska (1949, 1950), and Rényi and Révész (1963). 
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proper but only for auxiliary purposes. In the event that p is not ac- 
tually an unknown parameter, the measure M presented by Theorem 1 
seems at first sight to be such a purely auxiliary measure, but, as a matter 
of fact, M does measure certain interesting probabilities, at least ap- 
proximately. For example, letting 


1 n 
(6) Ei, =— >) Ir, 
n 1 
it can be shown that 
(7) lim P(z,(s) < 6) = M(p < 8). 


In words, the person considers the average of any large number of fu- 
ture observations to be distributed approximately the way p is dis- 
tributed by M. This is an extension of the ordinary weak law of large 
numbers, proved in [D2] along with a corresponding extension of the 
strong law. 

If the first n terms of a symmetric sequence are observed, how does 
the rest of the sequence appear to the person in the light of this obser- 
vation? In the first place, it also is a symmetric sequence but generally 
of a structure different from that of the original sequence, as may be 
shown thus: Let 


(8) my, — y) =vt P(a-(s) = 2,57 = 1, ---, 0), 
as one may for a symmetric sequence. Then 
(9) P(te(s) =%e3gq=qzEnrtl, --+,n+m| 2,(s) =%,r=1,---,n) 
a P(2p(s) = 2p, p = 1, °-:,;n +m) 
P(a,-(s) = 2, r = 1, --°, n) 
ry tz, (n—y) + (m—2) 
- my, — y) | 


where z is the number of 1’s among the z,’8s, g=n+1, -:-,n-+m. 
Equation (9) shows that the sequence x,, gy > 7, given that 2z,(s) = 2,, 
r=1,---,n, is a new symmetric sequence characterized by 


ry +2, (ny) + (m—2)) 
TY, n— y) 


The measure M’ associated with the new sequence is, according to 
Theorem 1, essentially determined by the condition that 


(10) n'(z,m — 2) =pf 
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(11) f p”™ dM"(p) = x'(m, 0) 
a m(m + Y,n — y) 
T(Y, n— y) 


f p™u(1 — p)"-¥ dM (p) 


TY, n— y) 
y 1 a n—Yy 

= fr? ~ ame. 
wy, n— y) 


Equation (11) makes it plausible that, except for the slight ambiguity 
permitted by Theorem 1, M’ is defined (for Borel sets B) by 


(12) M(B) = x7y,n—-y) J oY(1 — p)"-¥ dM(p), 


and this can in fact be demonstrated with some appeal to slightly ad- 
vanced methods pertaining to the Hausdorff moment problem (pp. 8-9 
of [S13]). 

It is noteworthy that, if M(B) = 0, then M’(B) = 0 also. In the 
event that p really is an unknown parameter, this means that, if the 
person is virtually certain that the true p is not in B, no amount of 
evidence can alter that opinion. 

Equation (12) shows that 1’ is generally different from M. Indeed, 
for fixed n > 1, M’ is clearly the same as M for every y for which 
r(y, n — y) > 0, if and only if 1 assigns the measure 1 to some one 
value of p. That is, the person regards evidence drawn from a sym- 
metric sequence as irrelevant to the future behavior of the sequence, if 
and only if at the outset he regards the sequence not merely as sym- 
metric but also as independent. 

It can be shown that the person regards it as highly probable that, 
if he observes a sufficiently long segment of a symmetric sequence, the 
continuation of the sequence will then be one for which the conditional 
variance of p, 


(18) [v dM'(p) — {fp amo} 


will be small. In the event that p is really an unknown parameter, this 
implies that the person is very sure that after a long sequence of obser- 
vations he will assign nearly unit probability to the immediate neigh- 
borhood of the value of p that actually obtains—a parallel to the ap- 
proach to certainty discussed in § 6. 


CHAPTER 4 


Critical Comments 


on Personal Probability 


1 Introduction 


It is my tentative view that the concept of personal probability in- 
troduced and illustrated in the preceding chapter is, except possibly 
for slight modifications, the only probability concept essential to sci- 
ence and other activities that call upon probability. I propose in this 
chapter to discuss the shortcomings I see in that particular personal- 
istic view of probability, which, for brevity, shall here be called simply 
“the personalistic view”; to point out briefly the relationships between 
it and other views; to criticize other views in the light of it; and to dis- 
cuss the criticisms holders of other views have raised, or may be ex- 
pected to raise, against it. 

From the standpoint of strict logical organization such critical re- 
marks are somewhat premature, because the personalistic view itself 
insists that probability is concerned with consistent action in the face 
of uncertainty. Consequently, until the theory of such action has been 
completely outlined in later chapters, the view to be criticized cannot 
even be considered to have been wholly presented. Practically, how- 
ever, it seems wise not to confine critical comments to the one part of 
the text that logic may suggest as appropriate, but rather to touch on 
criticism from time to time, even at the cost of some repetition. Thus, 
some of what is to be said here has already been said in the introductory 
chapter and elsewhere, and some of it will be said again. 

Views other than the personalistic view are to be discussed here, but 
it cannot be too distinctly emphasized that the account given of them 
will be very superficial.t One function of discussing other views is to 
provide the reader with at least some orientation in the large and di- 
versified body of ideas pertaining to the foundation of statistics that 


| Much more extensive comparative material is given by Keynes [K4], by Nagel 
[N1], and by Carnap [C1]. Koopman [K12] should also be mentioned in this con- 
nection. 
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have been accumulated. <A less obvious, but I think no less important 
and legitimate, function is to cast new light on the personalistic view, 
especially for those who already hold, or tend to hold, other views. 


2 Some shortcomings of the personalistic view 


I can answer, to my own satisfaction, some criticisms of the personal- 
istic view that have been brought to my attention. These points are 
discussed later in the chapter, but in this section I state and discuss 
as clearly as I can those that I find more difficult and confusing to 
answer. 

According to the personalistic view, the role of the mathematical 
theory of probability is to enable the person using it to detect incon- 
sistencies in his own real or envisaged behavior. It is also understood 
that, having detected an inconsistency, he will remove it. An incon- 
sistency is typically removable in many different ways, among which 
the theory gives no guidance for choosing. Silence on this point does 
not seem altogether appropriate, so there may be room to improve the 
theory here. Consider an example: The person finds on interrogating 
himself about the possible outcome of tossing a particular coin five 
times that he considers each of the thirty-two possibilities equally 
probable, so each has for him the numerical probability 1/32. He also 
finds that he considers it more probable that there will be four or five 
heads in the five tosses than that the first two tosses will both be heads. 
Now, reference to the mathematical theory of probability soon shows 
the person that, if the probability of each of the thirty-two possibilities 
is 1/32, then the probability of four or five heads out of five is 6/32, 
and the probability that the first two tosses will be heads is 8/32, so 
the person has caught himself in an inconsistency. The theory does not 
tell him how to resolve the inconsistency; there are literally an infinite 
number of possibilities among which he must choose. 

In this particular example, the choice that first comes to my mind, 
and I imagine to yours, is to hold fast to the position that all thirty-two 
possibilities are equally likely and to accept the implications of that 
position, including the implication that four or five heads out of five 
is less probable than two heads out of two. I do not think that there is 
any Justification for that choice implicit in the example as formally 
stated, but rather that in the sort of actual situation of which the ex- 
ample is a crude schematization there generally are considerations not 
incorporated in the example that do Justify, or at any rate elicit, the 
choice. 

To approach the matter in a somewhat different way, there seem to 
be some probability relations about which we feel relatively ‘‘sure’’ as 
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compared with others. When our opinions, as reflected in real or en- 
visaged action, are inconsistent, we sacrifice the unsure opinions to the 
sure ones. The notion of ‘‘sure” and ‘‘unsure”’ introduced here is vague, 
and my complaint is precisely that neither the theory of personal proba- 
bility, as it is developed in this book, nor any other device known to me 
renders the notion less vague There is some temptation to introduce 
probabilities of a second order so that the person would find himself 
saying such things as “the probability that B is more probable than C 
is greater than the probability that F is more probable than G.’”’ But 
such a program seems to meet insurmountable difficulties. 

The first of these—pointed out to me by Max Woodbury—is this. 
If the primary probability of an event B were a random variable b 
with respect to secondary probability, then B would have a “‘composite”’ 
probability, by which I mean the (secondary) expectation of b. Com- 
posite probability would then play the allegedly villainous role that 
secondary probability was intended to obviate, and nothing would have 
been accomplished. 

Again, once second order probabilities are introduced, the introduc- 
tion of an endless hierarchy seems inescapable. Such a hierarchy seems 
very difficult to interpret, and it seems at best to make the theory less 
realistic, not more. 

Finally, the objection concerning composite probability would seem 
to apply, even if an endless hierarchy of higher order probabilities were 
introduced. The composite probability of B would here be the limit 
of a sequence of numbers, F,,(Hy,_1(--: He(Pi(B))---)), a limit that 
could scarcely be postulated not to exist in any interpretable theory of 
this sort. The reader may wish to evaluate for himself the arguments 
in favor of such a hierarchy put forward by Reichenbach (Chapter 8, 
[R2]), taking proper account of the differences. between Reichenbach’s 
overall view, and his mathematical theory, of probability on one hand 
and, on the other, the personalistic view and measure-theoretic mathe- 
matical theory that are the basis of my critique of higher order proba- 
bilities. 

The interplay between the ‘“‘sure” and “‘unsure’”’ is interestingly ex- 
pressed by de Finetti (p. 60, [D2]) thus: “The fact that a direct estimate 
of a probability is not always possible is just the reason that the logi- 
cal rules of probability are useful. The practical object of these rules 
is simply to reduce an evaluation, scarcely accessible directly, to others 
by means of which the determination is rendered easier and more 
precise.” 

It may be clarifying, especially for some readers under the sway of 
the objectivistic tradition, to mention that, if a person is “sure” that 


+ One tempting representation of the unsure is to replace the person’s single 
probability measure P by a set of such measures, especially a convex set. Some 
explorations of this are Dempster (1968), Good (1962), and Smith (1961). 
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the probability of heads on the first toss of a certain penny is 4, it does 
not at all follow that he considers the coin fair. He might, to take an 
extreme example, be convinced that the penny is a trick one that al- 
ways falls heads or always falls tails. 

Logic, to which the theory of personal probability can be closely par- 
alleled, is similarly incomplete. Thus, if my beliefs are inconsistent 
with each other, logic insists that I amend them, without telling me how 
to do so. This is not a derogatory criticism of logic but simply a part 
of the truism that logic alone is not a complete guide to life. Since the 
theory of personal probability is more complete than logic in some re- 
spects, it may be somewhat disappointing to find that it represents no 
improvement in the particular direction now in question. 

A second difficulty, perhaps closely associated with the first one, 
stems from the vagueness associated with Judgments of the magnitude 
of personal probability. The postulates of personal probability imply 
that I can determine, to any degree of accuracy whatsoever, the proba- 
bility (for me) that the next president will be a Democrat. Now, it is 
manifest that I cannot really determine that number with great accu- 
racy, but only roughly. Since, as is widely recognized, all the interest- 
ing and useful theories of modern science, for example, geometry, rela- 
tivity, quantum mechanics, Mendelism, and the theory of perfect com- 
petition, are inexact; it may not at first sight seem disquieting that the 
theory of personal probability should also be somewhat inexact. As 
will immediately be explained, however, the theory of personal proba- 
bility cannot safely be compared with ordinary scientific theories in 
this respect. 

I am not familiar with any serious analysis of the notion that a theory 
is only slightly inexact or is almost true, though philosophers of science 
have perhaps presented some. Even if valid analyses of the notion 
have been made, or are made in the future, for the ordinary theories of 
science, it is not to be expected that those analyses will be immediately 
applicable to the theory of personal probability, normatively inter- 
preted; because that theory is a code of consistency for the person ap- 
plying it, not a system of predictions about the world around him. 

The difficulty experienced in § 2.6 with defining indifference seems 
closely associated with the difficulty about vagueness raised here. 

Another difficulty with the theory of personal probability (or, more 
properly, with that larger theory of the behavior of a person in the 
face of uncertainty, of which the theory of personal probability is a 
part) is that the statement of the theory is not yet necessarily complete. 
Thus we shall in the next chapter come upon another proposition that. 
demands acceptance as a postulate, and, since even this leaves the per- 
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son a great deal of freedom, there is no telling when someone will come 
upon still another postulate that clamors to be adjoined to the others. 
Strictly speaking, this is not so much an objection to the theory as a 
warning about what to expect of its future development. 


3 Connection with other views 


All views of probability are rather intimately connected with one an- 
other. For example, any necessary view can be regarded as an extreme 
personalistic view in which so many criteria of consistency have been 
invoked that there is no role left for the person’s individual judgment. 
Again, objectivistic views can be regarded as personalistic views ac- 
cording to which comparisons of probability can be made only for very 
special pairs of events, and then only according to such criteria that all 
(right-minded) people agree in their comparisons. 

From a different standpoint, personalistic views lie not between, but 
beside, necessary and objectivistic views; for both necessary and objec- 
tivistic views may, in contrast to personalistic views, be called objective 
in that they do not concern individual judgment. 


4 Criticism of other views 


It will throw some light on the personalistic view to say briefly how 
some other views seem to compare unfavorably with it. 

It is one of my fundamental tenets that any satisfactory account of 
probability must deal with the problem of action in the face of uncer- 
tainty. Indeed, almost everyone who seriously considers probability, 
especially if he has practical experience with statistics, does sooner or 
later deal with that problem, though often only tacitly. Even some 
personalistic views seem to me too remote from the problem of action, 
or decision. For example, de Finetti in [D2] gives two approaches to 
personal probability. Of these, one is almost exactly like the view 
sponsored here, except only that the notion ‘‘more probable than”’ is 
supposed to be intuitively evident to the person, without reference to 
any problem of decision. The other is more satisfactory in this re- 
spect, being couched in terms of betting behavior, but it seems to me 
a somewhat less satisfactory approach than the one sponsored here, be- 
cause it must assume either that the bets are for infinitesimal sums or— 
anticipating the language of the next chapter—that the utility of money 
is linear. The theory expressed by Koopman in [K9], [K10], and [K11] 
and that expressed by Good in [G2] are both personalistic views that 
tend to ignore decision, or at any rate keep it out of the foreground; 
but the personalistic view expressed by Ramsey in [R1], like the one 
sponsored here, takes decision as fundamental. If any necessary view 


4.4] CRITICISM OF OTHER VIEWS 61 


can be formulated at all, it might well be possible to formulate it in 
terms of decision, but, so far as I know, the notion of decision has not 
appeared fundamental to the holders of any necessary view. It seems 
fair to say that objectivistic views, by their very nature, must in prin- 
ciple regard decision as secondary to probability, if relevant at all. 
Yet, the objectivist A. Wald has done more than anyone else to popu- 
larize the notion of decision. 

As has already been indicated, from the position of the personalistic 
view, there is no fundamental objection to the possibility of construct- 
ing a necessary view, but it is my impression that that possibility has 
not yet been realized, and, though unable to verbalize reasons, I con- 
jecture that the possibility is not real. Two of the most prominent en- 
thusiasts of necessary views are Keynes, represented by [K4], and Car- 
nap, who has begun in [C1] to state what he hopes will prove a satis- 
factory necessary (or nearly necessary) view of probability. Keynes 
indicated in the closing pages of [K4] that he was not fully satisfied 
that he had solved his problem and even suggested that some element 
of objectivistic views might have to be accepted to achieve a satisfac- 
tory theory, and Carnap regards [C1] as only a step toward the estab- 
lishment of a satisfactory necessary view, in the existence of which he 
declares confidence. That these men express any doubt at all about the 
possibility of narrowing a personalistic view to the point where it be- 
comes a necessary one, after such extensive and careful labor directed 
toward proving this possibility, speaks loudly for their integrity; at the 
same time it indicates that the task they have set themselves, if possi- 
ble at all, is not a light one. 

Keynes, writing in 1921 of what are here called objectivistic views, 
complained, ‘‘The absence of a recent exposition of the logical basis of 
the frequency theory by any of its adherents has been a great disadvan- 
tage to me in criticizing it.” (Chap. VIII, Sec. 17, of [K4]). I believe 
that his complaint applies as aptly to my position today as to his then, 
though I cannot pretend to have combed the intervening literature 
with anything like the thoroughness Keynes himself would have em- 
ployed. Reichenbach, to be sure, presents in great detail an interest- 
ing view that must be classified as objectivistic [R2], but it seems far 
removed from those that dominate modern statistical theory and form 
the main subject of the following discussion. Whatever objectivistic 
views may be, they seem, to holders of necessary and personalistic 
views alike, subject to two major lines of criticism. In the first place, 
objectivistic views typically attach probability only to very special 
events. Thus, on no ordinary objectivistic view would it be meaning- 
ful, let alone true, to say that on the basis of the available evidence it 
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is very improbable, though not impossible, that France will become a 
monarchy within the next decade. Many who hold objectivistic views 
admit that such everyday statements may have a meaning, but they 
insist, depending on the extremity of their positions, that that meaning 
is not relevant to mathematical concepts of probability or even to sci- 
ence generally. The personalistic view claims, however, to analyze 
such statements in terms of mathematical probability, and it considers 
them important in science and other human activities. 

Secondly, objectivistic views are, and I think fairly, charged with 
circularity. They are generally predicated on the existence in nature 
of processes that may, to a sufficient degree of approximation, be rep- 
resented by a purely mathematical object, namely an infinite sequence 
of independent events. This idealization is said, by the objectivists 
who rely on it, to be analogous to the treatment of the vague and ex- 
tended mark of a carpenter’s pencil as a geometrical point, which is so 
fruitful in certain contexts. When it is pointed out to the objectivist 
that he uses the very theory of probability in determining the quality 
of the approximation to which he refers, he retorts that the applied 
geometer—a fictitious character whose reputation for solidity in science 
is unquestioned—likewise uses geometry in determining the quality of 
his approximations. Let the geometer then be challenged, and he re- 
plies with a threefold reference to experience, saying, “It is a common 
experience that with sufficient experience one develops good judgment 
in the use of geometry and thenceforth generally experiences success in 
the predictions he bases on it.’”’ “Now,” says the objectivist, ‘‘the 
geometer’s answer is my answer.’”’ But it seems to critics of objectivistic 
views that, though the geometer may be entitled to make as many allu- 
sions to experience as he pleases, the probabilist is not free to do so, 
precisely because it is the business of the probabilist to analyze the con- 
cept of experience. He, therefore, cannot properly support his position 
by alluding to experience until he has analyzed that concept, though 
he can, of course, allude to as many experiences as he wishes. 

Two sorts of mixed views call for special comment here. 

First, some (among them Carnap [C1]; Koopman [K9], [K10], and 
[K11]; and Nagel [N1]) hold that two probability concepts play a role 
in inference, an objectivistic one and a personalistic or a necessary one. 
This dualism is typically justified as necessary to the analysis of such 
a concept as that of a coin with unknown probability of falling heads. 
But, as § 3.7 explains, de Finetti has provided a satisfactory analysis 
on the basis of personal probability alone. 

Second, others—for example, van Dantzig [V1] and Féraud [F2]— 
finding the conventional objectivistic views circular for the reasons I 
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have cited, try to break the circle by relatively isolated use of subjec- 
tive ideas. Very crudely, it seems to be their position that in any one 
context it is allowable for a person to act as though some one event of 
sufficiently small (objective) probability, chosen at his discretion, were 
impossible. Quite apart from the relatively technical question of 
whether any consistent mixed view of this kind can be constructed, 
holders of personalistic and necessary views alike criticize them as un- 
necessarily timid, for they embrace subjective ideas, but only gingerly. 


& The role of symmetry in probability 


An important and highly controversial question in the foundations 
of probability is whether and, if so, how symmetry considerations can 
determine the probabilities of at least some events. 

Symmetry considerations have always been important in the study 
of probability. Indeed, early work in probability was dominated by 
the notion of symmetry, for it was usually either concerned with, or di- 
rectly inspired by, symmetrical gambling apparatus such as dice or 
cards. To illustrate those classical problems, suppose that a gambler is 
offered several bets concerning the possible outcome of rolling three 
dice, where it is to be understood that refraining from any bets at all 
may be among the available “bets.” Which of the available bets 
should the gambler choose? Perhaps I distort history somewhat in in- 
sisting that early problems were framed in terms of choice among bets, 
for many, if not most, of them were framed in terms of equity, that is, 
they asked which of two players, if either, would have the advantage 
in a hypothetical bet. But, especially from the point of view of the 
earlier probabilists, such a question of equity is tantamount to a ques- 
tion of choice among bets, for to ask which of two “equal” betters has 
the advantage is to ask which of them has the preferable alternative, 
as was pointed out quite explicitly by D. Bernoulli in [B10]. 

In effect, the classical workers recommended the following solution 
to the problem of three dice, with corresponding solutions to other 
gambling problems: 

1. Attach equal mathematical probabilities to each of the 216 (=6°) 
possible outcomes of rolling the three dice. (There are 6° possibilities, 
because the first, second, and third dice can each show any of six scores, 
all combinations being possible.) 

2. Under the mathematical probability established in Step 1, com- 
pute the expected winnings (possibly negative) of the gambler for each 
available bet. 

3. Choose a bet that has the largest expected winnings among those 


available. 
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At present it is appropriate to refrain from criticisms of the use 
made of expected winnings until the next chapter and to concentrate 
discussion on the notion that the 216 possibilities should be considered 
equally probable, which can conveniently be done by drastically reduc- 
ing the class of bets considered to be available. Say, for definiteness, 
that the only bets to be considered are simply even-money bets of one 
dollar, that the triple of scores falls in a preassigned subset of the 216 
possibilities. When attention is focused on this restricted class of bets, 
the total recommendation is seen to imply that the probability measure 
defined in the first step of the recommendation be adopted as the per- 
sonal probability of the gambler. To put it differently, a gambler who 
adopts the recommendation will hold the 216 possible outcomes equally 
probable not only in some abstract sense, but also in the sense of per- 
sonal probability as defined in § 3.2. 

The notion that the 216 possibilities should be regarded as equally 
probable is familiar to everyone; for it is taken for granted wherever 
gentlemen gamble as well as in the standard high-school algebra courses, 
where it serves to illustrate the theory of combinations and permutations. 

Traditionally, the equality of the probabilities was supposed to be 
established by what was called the principle of insufficient reason, tT 
thus: Suppose that there is an argument leading to the conclusion that 
one of the possible combinations of ordered scores, say {1, 2, 3}, is 
more probable than some other, say {6, 3, 4}. Then the information 
on which that hypothetical argument is based has such symmetry as 
to permit a completely parallel, and therefore equally valid, argument 
leading to the conclusion that {6, 3, 4} is more probable than {1, 2, 3}. 
Therefore, it was asserted, the probabilities of all combinations must 
be equal. 

The principle of insufficient reason has been and, I think, will con- 
tinue to be a most fertile idea in the theory of probability; but it is not 
so simple as it may appear at first sight, and criticism has frequently 
and justly been brought against it. Holders of necessary views typi- 
cally attempt to put the principle on a rigorous basis by modifying it 
in such a way as to take account of such criticism. Holders of personal- 
istic and objectivistic views typically regard the criticism as not alto- 
gether refutable, so they do not attempt to establish a formal postulate 
corresponding to the principle but content themselves—as I shall here 
—with exhibiting an element of truth in it. 

One of the first criticisms is that the principle is not strictly applicable 
for a person who has had any experience with the apparatus in ques- 


+ Perhaps what I here call the principle of insufficient reason should be called the 
principle of cogent reason. See Section 3 of [B15] for the distinction involved. 
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tion, or even with similar apparatus. Thus, attempts to use the prin- 
ciple, as I have stated it, to prove that there is no such thing as a run 
of luck at dice, as actually played, are invalid. The person may have 
had relevant experience, directly or vicariously, not only with gambling 
apparatus itself, but also with people who make and handle it, including 
cheaters. 

It is not always obvious what the symmetry of the information is in 
a situation in which one wishes to invoke the principle of insufficient 
reason. For example, d’Alembert, an otherwise great eighteenth-cen- 
tury mathematician, is supposed to have argued seriously that the prob- 
ability of obtaining at least one head in two tosses of a fair coin is 2/3 
rather than 3/4. (Cf. [T3], Art. 464.) Heads, as he said, might appear 
on the first toss, or, failing that, it might appear on the second, or, 
finally, might not appear on either. D’Alembert considered the three 
possibilities equally likely. 

It seems reasonable to suppose that, if the principle of insufficient 
reason were formulated and applied with sufficient care, the conclusion 
of d’Alembert would appear simply as a mistake. There are, however, 
more serious examples. Suppose, to take a famous one, that it is known 
of an urn only that it contains either two white balls, two black balls, 
or a white ball and a black ball. The principle of insufficient reason has 
been invoked to conclude that the three possibilities are equally proba- 
ble, so that in particular the probability of one white and one black 
ball is concluded to be 1/3. But the principle has also been applied to 
conclude that there are four equally probable possibilities, namely, that 
the first ball is white and the second also, that the first is white and the 
second black, etc. On that basis, the probability of one white and one 
black ball is, of course, 1/2. Personally, I do not try to arbitrate be- 
tween the two conclusions but consider that the existence of the pair 
of them reflects doubt on the notion that a person’s knowledge relevant 
to any matter admits any full and precise description in terms of 
propositions he knows to be true and others about which he knows 
nothing. 

Most holders of personalistic views do not find the principle of in- 
sufficient reason compelling, because they envisage the possibility that 
a person may consider one event more probable than another without 
having any compelling argument for his attitude. Viewed practically, 
this position is closely associated with the first criticism of the principle 
of insufficient reason, for the holder of a personalistic view typically 
supposes that the person is under the influence of experience, and pos- 
sibly even biologically determined inheritance, that expresses itself in 
his opinions, though not necessarily through compelling argument. 
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Holders of personalistic views do see some truth in the principle of 
insufficient reason, because they recognize that there are frequently par- 
titions of the world, associated with symmetrical-looking gambling ap- 
paratus and the like, that many and diverse people all consider (very 
nearly) uniform partitions. As was illustrated in the preceding sec- 
tion, we often feel more “sure” about probabilities derived from the 
judgment that such partitions are uniform than we do about others. 
Such partitions are, moreover, very important in that they provide 
some events the probability of which to diverse people is in agreement. 
Though the events concerned are often of no importance in themselves, 
agreement about them can, through the statistical invention of ran- 
domization, contribute to agreement about all sorts of issues open to 
empirical investigation. Widespread though the agreement about the 
near uniformity of some partitions is, holders of personalistic views 
typically do not find the contexts in which such agreement obtains 
sufficiently definable to admit of expression in a postulate. 

Holders of purely objectivistic views see no sense at all in the original 
formulation of the principle of insufficient reason, for it uses ‘“proba- 
bility” in a manner they consider meaningless. But they too see an 
element of truth in the principle, which they consider to be established 
as a part of empirical physics. Thus, for example, they regard it as an 
experimental fact, admitting some explanation in terms of theoretical 
physics, that three dice manufactured with reasonable symmetry will 
exhibit each of the 216 possible patterns with nearly equal frequency, 
if repeatedly rolled with sufficient violence on a suitable surface. 

Holders of personalistic views agree that experiments or, more gen- 
erally, experiences determine to a large extent when people employ the 
idea of insufficient reason. Thus, though experiments with gambling 
apparatus, quite apart from gambling itself, have a fascination that 
perhaps exceeds their real interest, such experiments are not altogether 
worthless. On the one hand, they provide strong evidence that a per- 
son cannot expect to maintain a symmetrical attitude toward any piece 
of apparatus with which he has had long experience, unless he is vir- 
tually convinced at the outset that the possible states of the apparatus 
are equally probable and independent from trial to trial. To say it in 
the more familiar and sometimes more congenial language of objective 
probability, long experiments with coins, dice, cards, and the like have 
always shown some bias, and often some dependence from trial to trial. 
On the other hand (and this has the utmost practical importance), it 
has been shown that, with skill and experience, gambling apparatus, or 
its statistical equivalent, can be manufactured in which the bias and 
the dependence from trial to trial are extremely small. This implies 
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that groups of very diverse people can be brought to agree that repeated 
trials with certain apparatus are nearly uniform and nearly independent. 
Thus certain methods of obtaining random numbers and other outcomes 
of uniform and independent trials, which are vital to many sorts of 
experimentation, have justifiably found acceptance with the scientific 
public. A stimulating account of practical methods of obtaining ran- 
dom numbers, and random samples generally, is given by Kendall in 
Chapter 8 (Vol. I) of [K2]. 


6 How can science use a personalistic view of probability? 


It is often argued by holders of necessary and objectivistic views alike 
that that ill-defined activity known as science or scientific method con- 
sists largely, if not exclusively, in finding out what is probably true, 
by criteria on which all reasonable men agree. The theory of proba- 
bility relevant to science, they therefore argue, ought to be a codifica- 
tion of universally acceptable criteria. Holders of necessary views say 
that, just as there is no room for dispute as to whether one proposition 
is logically implied by others, there can be no dispute as to the extent 
to which one proposition is partially implied by others that are thought 
of as evidence bearing on it, for the exponents of necessary views re- 
gard probability as a generalization of implication. Holders of objec- 
tivistic views say that, after appropriate observations, two reasonable 
people can no more disagree about the probability with which trials 
in a sequence of coin tosses are heads than they can disagree about the 
length of a stick after measuring it by suitable methods, for they con- 
sider probability an objective property of certain physical systems in 
the same sense that length is generally considered an objective property 
of other physical systems, small errors of measurement being contem- 
plated in both contexts. Neither the necessary nor the objectivistic 
outlook leaves any room for personal differences; both, therefore, look 
on any personalistic view of probability as, at best, an attempt to pre- 
dict some of the behavior of abnormal, or at any rate unscientific, 
people. 

I would reply that the personalistic view incorporates all the univer- 
sally acceptable criteria for reasonableness in Judgment known to me 
and that, when any criteria that may have been overlooked are brought 
forward, they will be welcomed into the personalistic view. ‘The cri- 
teria Incorporated in the personalistic view do not guarantee agreement 
on all questions among all honest and freely communicating people, 
even in principle. That incompleteness, if one will call it such, does not 
distress me, for I think that at least some of the disagreement we see 
around us is due neither to dishonesty, to errors in reasoning, nor to 
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friction in communication, though the harmful effects of the latter are 
almost incapable of exaggeration. 

As was mentioned in connection with symmetry, there are partitions 
that diverse people all consider nearly uniform, though not compelled 
to that agreement by any postulate of the theory of personal proba- 
bility. As has also been mentioned and as will be explained later (es- 
pecially in § 14.8), through the statistical invention of randomization, 
agreement about partitions pertaining to gambling apparatus of no im- 
portance in itself can be made to contribute to agreement in every 
part of empirical science. 

Another mechanism that brings people having some, but not all, 
opinions in common into more complete agreement was illustrated in 
§§ 3.6-7. Indeed, it was there shown that in certain contexts any two 
opinions, provided that neither is extreme in a technical sense, are al- 
most sure to be brought very close to one another by a sufficiently 
large body of evidence. 

It has been countered; I believe, that, if experience systematically 
leads people with opinions originally different to hold a common opinion, 
then that common opinion, and it only, is the proper subject of scien- 
tific probability theory. There are two inaccuracies in this argument. 
In the first place, the conclusion of the personalistic view is not that 
evidence brings holders of different opinions to the same opinions, but 
rather to similar opinions. In the second place, it is typically true of 
any observational program, however extensive but prescribed in ad- 
vance, that there exist pairs of opinions, neither of which can be called 
extreme in any precisely defined sense, but which cannot be expected, 
either by their holders or any other person, to be brought into close 
agreement after the observational program. 

I have, at least once, heard it objected against the personalistic view 
of probability that, according to that view, two people might be of 
different opinions, according as one is pessimistic and the other opti- 
mistic. I am not sure what position I would take in abstract discussion 
of whether that alleged property of personalistic views would be ob- 
jectionable, but I think it is clear from the formal definition of qualita- 
tive probability that the particular personalistic view sponsored here 
does not leave room for optimism and pessimism, however these traits 
be interpreted, to play any role in the person’s judgment of probabilities. 


+ See (Fisher 1934), p. 287. 


CHAPTER 5 


Utility 


1 Introduction 


The postulates P4-6, introduced in Chapter 3, have already led to 
simplification of the relation < in so far as it applies to acts of a special 
but important form. Indeed, through the introduction of numerical 
probability, those special comparisons have been reduced to ordinary 
arithmetic comparison of numbers in such a way that many relations 
among acts are deducible by simple and systematic arithmetic calcula- 
tion. In this chapter it will be shown that the arithmetization of com- 
parison among acts can, with the introduction of one mild new postu- 
late, be extended to virtually all pairs of acts. 

This far-reaching arithmetization of comparison among acts is 
achieved by attaching a number U(f) to each consequence f in such a 
way that f < g if and only if the expected value of U(f) is numerically 
less than or equal to that of U(g), provided only that the real-valued 
functions U(f) and U(g) are essentially bounded. The provision can 
fail to be met only if there exist acts that are, so to speak, distinctly 
preferable to any fixed reward or distinctly worse than any fixed punish- 
ment. 

A function U that thus arithmetizes the relation of preference among 
acts will be called a utility. It will be shown that the multiplicity of 
utilities is not complicated, every utility being simply related to every 
other. I have chosen to use the name “utility” in preference to any 
other, in spite of some unfortunate connotations this name has in con- 
nection with economic theory, because it was adopted by von Neumann 
and Morgenstern when in [V4] they revived the concept to which it re- 
fers, in a most stimulating way. Their treatment has been of such wide- 
spread interest that the introduction of a name other than “utility” at 
the present time would cause more confusion than it could alleviate. 

The next three sections are concerned with the technical exploration 
of the utility concept. I think readers interested in the details will find 
it best to read these sections twice as a unit, in the fashion I have been 
recommending for other material in which definitions and propositions 

69 


70 UTILITY [5.2 


are interlarded with proofs; others will be content with a cursory read- 
ing, omitting proofs. 

Taking advantage of the simplicity afforded by the introduction of 
utility, I try in §5 to make some progress with the problem, pointed 
out in § 2.5, of specifying criteria for the construction of ‘‘small worlds.”’ 

Finally, § 6 briefly reports the history of the utility idea. A separate 
critical section is not necessary, because the criticisms of the theory of 
utility known to me are incorporated conveniently into the historical 
section. 


2 Gambles 


Before discussing utility, it is expedient to establish certain facts, 
the first being that at least among a rather rich class of acts, namely 
acts confined with probability one to a finite number of consequences, 
preference depends only on the probability distribution of the conse- 
quences of the acts. 


THEOREM 1 
Hyp. 1. fi, °° +, fn are n elements of F, n > 1. 
2. p1, °**, Pn are numbers such that Zp; = 1. 


3. g and h are acts such that 
P(g(s) = fi) = P(h(s) = fi) = Pi; i= Legon 
CONCL. g =h. 


Proor. The theorem is obvious for n = 1. It will be proved by in- 
duction, supposing henceforth that n > 1. 

Let B denote the intersection of the two events that g(s) = f, and 
h(s) ¥ fn, and let C denote the intersection of the two events that 
h(s) = fn and g(s) # fn. It is easy to see that P(B) = P(C). C can 
be partitioned into Co, Ci, ---, Cr—1, where Co is a null event and C;,, 
} = 1, ---, n — 1, is the intersection of C' with the event that g(s) = f;. 
By repeated application of Conclusion 7 of Theorem 3.3.3, B can be 


partitioned into events Bo, Bi, ---, Bn—, such that P(B;) = P(C;), 
1=0,---,n—-—1. 

Let go = g, and define g;,, step by step for z = 0, ---, m — 2 thus: 
(1) giti(s) = fn for s ¢ Cy41, 


= fisa for S¢ Bian, 
= g;(s) elsewhere. 


It is easily seen from the facts of conditional probability that g;41 = 
g: given B;,,; U C;41, and it is even more obvious that g;41 = g; given 
~(Bizi U Ci4i). Therefore gi41 = gi, so gn_1 = g. Furthermore, 
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P(gi4i(s) = f3) = P(gs(s) = f 3) = 93, 80 P(Qn-i(s) = fi) = 93, J = 1, 
--+,n. Thus gy,_, 1s not only equivalent to g but also satisfies the hy- 
pothesis of the theorem relative to h, so it will suffice to prove the theo- 
rem for gn—; and h in place of g and h. 

Now gn—; has been constructed to equal f, in C, except on a null set. 
Therefore gn; = h given C U D, where D is the subset of ~C on 
which £n—-1 = h = Te 

It remains only to show that g,_; = hgiven~(C U D). If ~(C U D) 
is null, that is true automatically; henceforth concentrate on the less 
trivial situation. If ~(C U D) is not null, then < given ~(C U D) 
satisfies all the postulates assumed thus far, and therefore the conse- 
quences fi, -++, fa—1; the numbers p;’ = p;/(1 — pn), = 1, °--,n — 1; 
the acts gn—1 and h; and the relation < given ~(C U D) satisfy the 
hypothesis of the theorem for a case in which it, is supposed already to 
have been proved. @ 


In this chapter the notation Zp;f; will denote the class of all acts f- 
for which there exist partitions B; of S such that P(B;) = p; and f(s) = 
f; for s ¢B;. Here the f;’s are a finite sequence of consequences (not 
necessarily distinct), and the p,’s a corresponding sequence of non- 
negative real numbers such that 2p; = 1. In view of Conclusion 7 of 
Theorem 3.3.3, such a class of acts, which will in this chapter be re- 
ferred to as a gamble and denoted by f, g, h, or the like, always has at 
least one element. Theorem 1 says, in effect, that the person regards 
all elements of any gamble as equivalent. To put it differently, if the 
events B; of a partition have the probabilities p;, and if the act f is 
such that the consequence f; will befall the person in case B; occurs, 
then the value of f is independent of how the partition B; is chosen. 

Gambles can be mixed, in a sense, to make new gambles, thus: Let 
f; be a finite sequence of gambles, 


(2) f; = 25 pig hii, 


and o; a corresponding sequence of non-negative real numbers such 
that 20; = 1. The mizture of the f,’s with weights o;, denoted Za;f;, is 
defined by 
(3) Zot; = 2D. 9; {2 pisfii} 

a 


Jj 


= a (o;p43) fej, 


which is meaningful, the f;;’s being consequences and the (¢;p;,;)’s being 
numbers such that 2(0;p;;) = 1. Such mixtures are exemplified by an 
insurance policy in which the benefit is an annuity payable during the 
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life of the beneficiary, and by a lottery in which the prizes are tickets 
in other lotteries. 

In view of Theorem 1, it is natural to say that f < g means that, for 
every act f in the class of acts corresponding to f,f < g. Corresponding 
definitions are to be understood for f < g, f< g, f < g, ete. 


THEOREM 2 If f, g, and A are gambles, and 0 < p < 1; then pf + 
(1 — p)h < pg + (1 — p)h, if and only if f < g. 


Proor. Let f, g; f;, g;; and B;, C; be acts, consequences, and parti- 
tions such that f and g are among the acts represented by f and g, re- 
spectively, with f(s) = f; for s ¢ B; and g(s) = g; for s « C;. 

Construct D;; C B; NM C; such that P(D,;) = pP(B; N C;), and let 
D=UD,;. Then P(D) = p, P(B;| D) = P(B,), and P(C;| D) = 
P(C;). 

What is to be proved is, in effect, that f < g given D, if and only if 
f <g. In view of Theorem 1 it is clear that whether that is so or not 
for f and g does not depend on the particular choice of D; so, with an 
obvious temporary extension of terminology, it is to be proved that f < g 
given p, if and only if f < g. 

If f = g given a for every 0 < a <1, there is nothing to prove. 
Otherwise it can be assumed without loss of generality that, for some 
Qo, f< g£ given ao. 

In view of Theorem 2.7.2, if a+ 6 <1, f > g given a, and f > g 
given 6; then f > g given (a + 8), and similarly f > g given a/2. 

Making use of P6 and Theorem 2.7.2, it can easily be shown that, for 
any a sufficiently close to ao, f < g given a. 

The preceding three paragraphs imply that, in the case at hand, 
f < g givena foreverya,0O<a<l1l.@ 


THEOREM 3 If f<g, and 0<o<p<1, then pf+ (1 — p)g < 
of + (1 — o)g. 


Proor. In view of the immediately verifiable identities, 


pf + (1 — p)g = (po — o fF + [1 — (9p — o)] X 
o (1 — p) 
—__—__—__—— f + ——_———_ g} : 
1—- —o 1- —o¢ 

(4) (p ) (o — o) 

of + (1 — o)g = (9p — o)g + [1 — (p — o)] X 


og (1 — p) 
Racer Oeteeey Jh meme ce neee 
ee 'T- Gea 
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this theorem is a special case of Theorem 2; unless p = 1, and o = QO, 
in which case it is trivial. @ 


THEOREM 4 If f, < f. and f; < g < fo, then there is one and only 
one p such that pf, + (1 — p)f, = g. 


Proor. It follows immediately from Theorem 3 and the principle of 
the Dedekind cut f that there is one and only one po such that 


of, + (1 — a)fo < g, if o> po 


(5) 
of, + (1 — a)f, > g, if « < po. 


According to (5), no number, except possibly po, can satisfy the equiv- 
alence demanded by the theorem. 

Finally, using (5) and P6 (much as it was used in the proof of Theo- 
rem 2), it follows that po does indeed satisfy the equivalence. @ 


3 Utility, and preference among gambles 


The idea of utility can most conveniently be introduced in connec- 
tion with gambles or, equivalently, acts that with probability one are 
confined to a finite number of consequences, thus: A utility is a function 
U associating real numbers with consequences in such a way that, if 
f = Lpif; and g= 2G 59;; then f < g, if and only if LpiU (f;) < 20;U(g;). 
Writing U[f] for 2p;U(f;), the condition takes the form U[f] < U[g]. 
Similarly, it is convenient to understand that, for an act f, 


(1) Ulf] = E(U(E)). 


In this notation the following obvious theorem gives a slightly different 
characterization of utility. 


THEOREM 1 A real-valued function of consequences, U, is a utility; 
if and only if f < g is equivalent to U[f] < U[g], provided f and g are 
both with probability one confined to a finite set of consequences. 


Do the postulates thus far assumed guarantee that any utilities exist 
at all? Can Theorem 1 be extended to an even wider class of acts? 
Does a great diversity of utilities exist, or does the relation < practi- 
cally determine the function U? These questions, here mentioned in 
the order in which they most naturally arise, are manifestly of great 
importance in understanding utility. For technical reasons, they will 


{| Cf., if necessary, any introduction to the theory of the real numbers for explana- 
tion of this principle, e.g., Chapter IT of [G3]. 
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be answered in a different order—the third followed by the first in this 
section, and the second in the next section. 

If there is a utility at all, there is surely more than one, because a 
utility plus a constant and a utility times a positive constant are also 
obviously utilities; thus: 


THEOREM 2 If U is a utility, and p, o are real numbers with p > 0; 
then U’ = pU + a is also a utility. 


CoROLLARY 1 If there exists a utility, and if f < g; then there ex- 
ists a utility U for which U(f) and U(g) are any preassigned pair of 
numbers, provided U(f) < U(g). 


Theorem 2 says that any increasing linear function of a utility is a 
utility. The next theorem says that, conversely, any two utilities are 
necessarily increasing linear functions of one another. 


THEOREM 3 If U and U’ are utilities, there exist numbers p and o 
such that U’ = pU +c, p > 0. 


Proor. The first step of the proof will be to demonstrate the fol- 
lowing identity for the two utilities U and U’ and for any three conse- 


quences f, g, h. 
1 1 1 


(2) U(f) Ug) Uh) | =9. 
Uf) U'g) U'(h) 


If any two of the consequences f, g, h are equivalent, two columns of 
the determinant in question are equal, and therefore the determinant 
vanishes. It can be assumed, then, that no two of f, g, and h are equiv- 
alent; and there is no loss in generality, as may be seen by permuting 
columns, in assuming f <g <h. Theorem 2.4 now permits the con- 
clusion that there is a p such that pf + (1 — p)h = g. Therefore, 


1 = pl + (1 — p)1 
(3) U(g) = pU(f) + (1 — p)U(h) 
U'(g) = pU'(f) + (1 — p)U'(h). 


Thus the middle column of the determinant is linearly dependent on 
the other two, so the determinant vanishes, as was asserted. 

Now let g and h be any fixed pair of consequences such that g < h, 
the existence of such a pair being assured by P5. Equation (2) can be 
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successively rewritten, where f is an arbitrary consequence, thus: 
(4) 1[U(g)U'(h) — UA)U"(g)] — UF)[U'(A) — U'(g)] 

+ U'(f)[U(h) — U(g)] = 0, 


U'(h) — U'(g) U(g)U'(h) — U(h)U"(g) 
5 ty" = ——_____-_-—— |] = 

2 ” U(h) — Ug) oe U(h) — Ug) 

which proves the theorem; for U’(h) — U’(g) and U(h) — U(g) are 
both positive. @ 


COROLLARY 2 If U and U’ are utilities such that, for some g < A, 
U(g) = U'(g) and U(h) = U’(h); then U and U’ are the same, that is, 
for every f, U(f) = U’(f). 


To summarize, if there is a utility at all, there are an infinite number, 
but the array of utilities is not complicated; for all can be generated 
from any one by increasing linear transformations. 


Turn now to the question of existence. 
THEOREM 4 There exists a utility. 


Proor. Von Neumann and Morgenstern prove essentially this theo- 
rem, as well as the preceding one, in the appendix of [V4]. The following 
proof is theirs, expressed, as the teacher used to say, in my own words. 

For this proof only, certain special nomenclature is introduced. A 
set of gambles F is convex; if and only if, for every f, g¢F and p, 0 < p 
<1, pf + (1 — p)geF. An interval I of gambles is the set of all gam- 
bles f such that, for some fixed g and h (which determine the interval), 
g<f<h. A hyper-utility V on a convex set F is a real-valued func- 
tion of the gambles of F, such that f < g, if and only if V(f) < V(g), 
and such that V(of + (1 — p)g) = eV(F) + (1 — p)V(g). 

The following remarks about this special nomenclature are obvious 
and will be repeatedly used in the proof, without explicit reference. 
The set of all gambles is convex. The intersection of two convex sets 
is convex. Every interval is convex. There is an interval containing 
any finite set of gambles. If there is a hyper-utility on the set of all 
gambles, it is a utility when confined to consequences. 

By the same method that led to the proofs of Theorems 2 and 3, 
if there is a hyper-utility on F containing g and h, with g < h, then there 
is one and only one hyper-utility V on F such that V(g) = 0 and V(h) 
= ], 
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If J is the interval determined by g < h, then, according to Theorem 
2.4, there is for every f in J a unique number, call it V(Ff), such that 


(6) = (1 — Vif))g + V(Ah. 
By repeated use of Theorem 2.2, it follows for any f, f’ ¢ I that 
(7) pf + (1 — p)f = pi (1 — V(F))g + V(Ah} 

+ (1 — p)il — ViF))g + V(F)A} 
{1 — [oV(F) + (1 — p)V (Fg 

+ [oV(Ff) + (1 — p)V(F)Ih, 


so V is a hyper-utility on the convex set I. 

From here on in this proof, let g, h be a fixed pair of consequences with 
g <h. Making use of the preceding two paragraphs, there is a unique 
hyper-utility assigning the values 0 and 1 to g and h, respectively, on 
any one interval containing g and h. The intersection of two such in- 
tervals is a convex set containing g and h, and on the intersection the 
hyper-utilities associated with the two intervals are both hyper-utilities 
attaching 0 and 1 to g and h, respectively; they must, therefore, be 
equal to one another on the intersection. 

Any gamble f is an element of some interval containing g and h. 
Let V(f) be the common value assigned to f by all the hyper-utilities 
that are defined on intervals containing f, g, and h and that assign the 
values 0 and 1 to g and h, respectively. Since there is always at least 
one such interval for any gamble f, the function V is defined for all 
gambles. 

The proof will be complete when it is shown that V is a hyper-utility 
for the convex set of all gambles. Let f and f’ be any two gambles and 
panumber, 0 <p <1. There is an interval containing f, f’, g, h, and 
pf + (1 — p)f’. In that interval the function V is a hyper-utility. 
Therefore V(of + (1 — p)f’) = pV(f) + (1 — p)V(F’) and Vf) < V(f’), 
if and only iff < f. @ 


4 The extension of utility to more general acts 


The requirement that an act have only a finite number of conse- 
quences may seem, from a practical point of view, almost no require- 
ment at all. To illustrate, the number of time intervals that might 
possibly be the duration of a huinan life can be regarded as finite, if 
you agree that the duration may as well be rounded to the nearest 
minute, or second, or microsecond, and that there is almost no possi- 
bility of its exceeding a thousand years. More generally, it is plausible 
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that, no matter what set of consequences is envisaged, each conse- 
quence can be practically identified with some element of a suitably 
chosen finite, though possibly enormous, subset. It might therefore 
seem of little or no importance to extend the concept of utility to acts 
having an infinite number of consequences. If that argument were 
valid, it could easily be extended to reach the conclusion that infinite 
sets are irrelevant to all practical affairs, and therefore to all parts of 
applied mathematics. But it is one of the most profound lessons of 
mathematical experience that infinite sets, tactfully handled, can lead 
to great simplification of situations that could, in principle, but only 
with enormous difficulty, be treated in terms of finite sets. How diffi- 
cult it would be to study geometry if one made at the outset the ‘‘sim- 
plifying assumption” that to all intents and purposes at most 10!:°° 
points in space can be discriminated from one another! Again, it is 
generally more convenient and fruitful to think of the annual cash in- 
come of an individual or firm as a continuous variable with an infinite 
number of possible values than as a discrete variable confined to some 
large finite number of values, even if it is known that the income must 
be some integral number of cents less, say, than 10!°. 

One way to extend the concept of utility to acts with an infinite 
number of consequences would be to postulate: If U[f] and U[g] both 
exist (the values +o and —o being regarded as possible); f < g, if 
and only if U[f] < U[g]. I see no serious objection to making this as- 
sumption outright, though it might be complained that the assumption 
is motivated more by general mathematical intuition and experience 
than by intuitive standards of consistency among decisions, which I 
have tried to take as my sole guide thus far. A statement almost as 
strong as the one in question can, however, be derived on adjoining a 
new postulate, P7, more in the spirit of Pl-6. That rather technical 
program will be carried out in the next several paragraphs. Those not 
interested can safely skip to the paragraph following Corollary 1 on 
page 80. 

Suppose that every possible consequence of the act g is at least as 
attractive to the person as the act f considered as a whole; then it seems 
to me within the spirit of the sure-thing principle to conclude that 
f < g; the same might as fairly have been said for the relations >, and 
also for the two relations < given B and > given B. This idea is for- 
malized in the following postulate, which, according to the conven- 
tions of mathematical double-talk, is to be interpreted as two proposi- 
tions—one having < and the other > throughout. 


P7 If f < (>) g(s) given B for every s ¢ B, then f < (>) g given B. 
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Attention has been called to the mathematically useful fact that, if 
P1-6 apply to a relation <, then they also apply to any relation < 
given B, provided B is not null. It is obvious that the same is true for 
P1-7, a fact that will be used often. It is also noteworthy that P1-7 
obviously imply the propositions that arise if in them every instance 
of the sign < is replaced by > and every instance of > is replaced by 
<. Therefore in any deduction from P1-7 every instance of the signs 
< and > can be reversed to produce a deduction that may be called 
the symmetric dual of the original deduction. This remark, a legitimate 
child of the principle of insufficient reason, has not been important 
heretofore, because almost all deductions thus far made have been their 
own symmetric duals. Since that will not be so of some of the lemmas 
in the present section, much needless writing and thinking can be saved 
by agreeing at the outset that, once a result is proved, it and its sym- 
metric dual may be used as if both had been explicitly proved. 

Before going to work with P7, some may wish to see an example of 
a mathematical structure satisfying P1-6 but not satisfying P7. More- 
over, understanding of such an example will do much to clarify the uses 
to be made of P7. To construct the example, begin by letting S be a 
set carrying a finitely additive probability measure P under which S 
can be partitioned into subsets of arbitrarily small probability. Let 
the set of consequences be the half-open interval of numbers 0 < f < 1. 
Let U(f) = f, Ulf] = E(f), and 


(1) Vif] = lim P{f(s) 2 1 — ¢}. 


Since the probability in (1) decreases with e, there is no question about 
the existence of the limit. Now let W(f] = U[f] + V[f], and define 
f < g to mean that W[f] < Wlg]. Checking postulates Pi-6, it will 
be found that the < thus defined satisfies them all, and that what has 
here been called U(f) is indeed a utility for <. But if, for example, 
there is an f such that U[f] = V[f] = 4, P7 is violated, as can be seen 
by comparing f to the act that, for each s, takes as value the maximum 
of 2 and f(s). Whether there can be such an f, may, so far as I know, 
depend on the choice of S and P. But, if the positive integers are taken 
as S, and P is so chosen that though the probability of any one integer 
is 0 the probability of the set of even integers is 1/2, a possibility as- 
sured by the note to Section 3 of Chapter II on p. 231 of [B4], the func- 
tion equal to 0 at the odd integers and equal to (1 — 1/n) at each even 
nis such anf. Finite, as opposed to countable, additivity seems to be 
essential to this example; perhaps, if the theory were worked out in a 
countably additive spirit from the start, little or no counterpart of P7 
would be necessary. 


+ Fishburn (1970, Exercise 21, p. 213) has suggested an appropriate weak- 
ening of P7. 
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Several lemmas depending on P7 are now to be proved preparatory 
to proving that U[f] governs preference for a very large class of acts. 
It is to be understood throughout the section that U is any fixed utility. 
The truth of each lemma is intuitively clear, in the sense that each could 
justifiably be accepted as a postulate if need be. Since they are also 
easy to prove and of secondary interest, condensed proofs will suffice. 


LEMMA 1 If, for every consequence h, f < h, andg < h; then f = g. 


Proor. Consider in the light of P7 that f < g(s) and g < f(s) for 
every Ss. @ 


LEMMA 2 If there exists a consequence fp such that f < fo, and if 
U(f(s)) < Uo for every s, then there exists a gamble g such that f < g 
and U[g] < Uo. 


Proor. If U(fo) < Uo, then g can be taken to consist of fo alone. 
Otherwise, let f; be any consequence such that U(f,) < Uo and let g 
be the unique mixture of fo and f; such that U(g) = Uo. @ 


LEMMA 3 


Hyp. 1. The B,’s, 1 = 1, ---, n, are a partition, and the U,’s are 
corresponding numbers. 

2. f is an act such that U(f(s)) < U; for s  B;. 

3. fis a gamble such that f < f. 


CoNCL. Ulf] < 2U;P(B;). 


Proor. If the lemma were false, it would be false even for some f < f. 
Then it may be assumed, modifying f if need be by means of P6 and 
Lemma 1, that there exists for each 7 an f; such that f < f; given B,. 
Now, in view of Lemma 2, there exists for each 7 a g; such that f < g; 
given B; and U[g,;] < U;. Let g = 2P(B;)g:, and observe that f < 
f<g. Therefore, U[f] < U[g] = 2P(B;)U(g:) < 2P(B:)U;. @ 


An act will be called bounded if its utility is, according to ordinary 
mathematical usage, an essentially bounded random variable; the no- 
tion is put in a more formal and self-contained way as follows: A bounded 
act is an act f such that, for some two numbers Up and U;, P{Uo < 
U(f(s)) < U,;} = 1. The definition is clearly not dependent on the 
choice of U. 


THEOREM 1 If f and g are bounded, then f < g, if and only if 
Ulf] < Ulg]. 


Proor. If there exist g and h such that g < f < h, then there is, 
by Theorem 2.4, a mixture f of g and h such that f = f. The null event 
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on which U(f(s)) is not. between Up and U,; may as well be disregarded; 
the rest can be partitioned into n + 1 events B; defined by the condition 
that s¢B; if and only if V;_; < U(f(s)) < Vi, 7=1, ---, n+1, 
where 


(2) vi={(1-*)uo+= 0). P20 coe: 
nN 


nL 


Applying Lemma 3 and its symmetric dual, 


(3) 2V;1P(B,) < Ulf] < 2V;P(B)). 

Similarly, according to Exercise 3 of Appendix 1, 

(4) ZV;-1P(Bi) < Ulf] < 2V.P(Bi). 

Therefore 

(5) | Uff] — UIA < 3(Vs — Vis) P(B) = (U1 — Uo)/n, 


whence U(f) = U(f). 

To consider the remaining case, suppose that the bounded act f ex- 
ceeds (is exceeded by) every consequence; call it for the moment big 
(little). According to Lemma 1, all big (and, dually, all little) acts are 
equivalent to one another. Furthermore, it is, for example, easily seen 
that, if an act is big, then for e« > 0, 


(6) P{U(I(s)) 2 sup U(f) ~ «} = 1. 


(Some may be more familiar with the notation ‘“LUB” and “GLB,” 
read ‘‘least upper bound” and ‘‘greatest lower bound,” than with the 
corresponding “‘sup” and “‘inf,’’ read ‘‘supremum”’ and “infimum.” If 
even these older terms are not familiar, see Exercise 4 of Appendix 2.) 
Therefore, if there are big (little) acts, they all have the same expected 
utility, namely sup U(f) (inf U(f)). 

Suppose now that f < g. It is possible that f and g are both little; 
that f is little, and g is equivalent to some gamble; that f is little and 
g big; that f and g are each equivalent to some gamble; that f is equiva- 
lent to some gamble, and g is big; or, finally, that they are both big. 
In each of these cases, a simple argument shows that U[f] < Ulg]. 
The converse arguments are similar. ®@ 


CoroLuary 1 If f and g are bounded, and P(B) > 0, then f<g 
given B, if and only if E(U(f) — U(g)| B) <0. 


It would be possible to explore unbounded acts for which expected 
utility exists to see whether expected utility governs preferences among 
even such acts under postulates P1—7 or under some extension of them* 


+ Peter Fishburn (1970, pp. 194, 206-207) and I have since discovered to 
my surprise that these postulates imply bounded utility, which puts the next 
several paragraphs in a new light. 
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I do not think, however, that the question is sufficiently interesting to 
warrant attention here, especially since there is some reason, first stated 
by Gabriel Cramer in a letter partially reproduced in [B10], to postulate 
that there are upper and lower bounds to utility, in which case all acts 
would necessarily be bounded. 

Even without P7, the postulates imply, in the following sense, that 
no gamble has infinite or minus infinite utility. 

An act f has infinite (minus infinite) utility; if and only if, for some 
g <(>)h and for every ¢« > 0, there is a B with P(B) < e and such 
that the act equal to f on B and to g on ~B exceeds (is exceeded by) h. 
A gamble or a consequence would be said to have infinite (minus tn- 
finite) utility, if one of the acts corresponding to it had infinite (minus 
infinite) utility. 

Indeed, Theorem 2.4, a deduction from P1-6, obviously implies that 
there are no infinite or minus infinite gambles or consequences. It 
may, however, be mentioned that Pascal held that, in just the sense 
at hand, salvation is an infinite consequence ({P2], pp. 189-191). Again, 
it is often said, in effect, that the utility to a person of immediate death 
is a consequence of minus infinite utility, but casual observation shows 
that this is not true of anyone—at least not of anyone who would cross 
the street to greet a friend. In the same vein, medicine often gives lip 
service to the idea that the death of a patient is of minus infinite utility, 
and, of course, doctors do go to great lengths to keep their patients 
alive; but a doctor who took the idea too seriously would make a nui- 
sance of himself and soon find himself with no patients to treasure. 

If the utility of consequences is unbounded, say from above,{ then, 
even in the presence of Pl-7, acts (though not gambles) of infinite 
utility can easily be constructed. My personal feeling is that, theo- 
logical questions aside, there are no acts of infinite or minus infinite 
utility, and that one might reasonably so postulate, which would amount 
to assuming utility to be bounded. 

Justifiable though it might be, that assumption would entail a cer- 
tain mathematical awkwardness in many practical contexts. For ex- 
ample, as will be discussed at greater length in Chapter 15, it sometimes 
seems reasonable to suppose that the penalty for acting as though a 
particular unknown number were fi instead of its true value, u, is propor- 
tional to 6” = (u — f)*. But, if the possible values of u are unbounded, 
then so are the possible values of 6, so utility is here taken to be un- 
bounded. On close scrutiny of such an example one always finds that 


{ That is, if, for every V, there is a consequence f such that V < U(f). This 
manner of speaking is permissible; because in view of Theorem 3.3, if one utility is 
bounded, all are. 
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it is not really reasonable to assume the penalty even roughly propor- 
tional to 6” for large values of 6”, but rather that large values are so im- 
probable that the error made in misappraising the penalty associated 
with them is negligible compared to the saving in simplicity resulting 
from the misappraisal. If the assumption of bounded utility were made 
part of the theory of personal probability, then any example in which 
unbounded utility is used for mathematical simplicity would be in con- 
tradiction to the postulates. I propose, therefore, not to assume bounded 
utility formally, but to remember that problems involving unbounded 
utility are to be handled cautiously. 

To take stock of the chapter thus far, utility having been established, 
it is now superfluous to consider that consequences may be of all sorts, 
since the postulates imply that in virtually every context a consequence 
is adequately characterized by its utility, some one utility function 
having been chosen from the linear family of possibilities. Therefore, 
unless the contrary is clearly indicated, f, g, and h will henceforth mean 
not exactly consequences in the sense used to date, but rather real 
numbers measuring utility in units to be called utiles. Correspondingly, 
an act f will henceforth be understood to be a real-valued random varia- 
ble. The entire theory of preference, at least for bounded acts, can 
now be summarized by the following résumé: 


Rf <g given B, if and only if P(B) = 0, or E(f — g| B) <0. 


From now on, though not formulated as a postulate, it is to be assumed 
without further quibbling that R holds, provided only that E(f) and 
E(g) exist and are finite; no attempt will be made to compare acts for 
which the expected value does not exist or is infinite. 

If a person is free to decide among a set F of acts, he will presumably 
choose one the expectation of which is v(F), where 


(7) v(F) = sup Eff), 


provided that such a one exists. This provision must be mentioned, 
even though a set F for which v(F) = © will, by convention, not be 
considered to give rise to a valid decision problem; for, if F is infinite in 
number, there may be no act in F with expectation quite as great as 
v(F). Nonetheless, v(F) may, in a sense, be regarded as the value or 
utility of the set of acts F, as is discussed in the penultimate paragraph 
of § 6.5. 


5 Small worlds 


Allusion was made in the penultimate paragraph of § 2.5 to the prac- 
tical necessity of confining attention to, or isolating, relatively simple 
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situations in almost all applications of the theory of decision developed 
in this book. As was mentioned there, I find it difficult to say with 
any completeness how such isolated situations are actually arrived at 
and justified. The purpose of the present section is to take some steps 
toward the solution of that problem or, at any rate, to set the problem 
forth as clearly as I can. This section, though important for a critical 
evaluation of the thesis of this book, is not essential to a casual reading. 

Making an extreme idealization, which has in principle guided the 
whole argument of this book thus far, a person has only one decision 
to make in his whole life. He must, namely, decide how to live, and 
this he might in principle do once and for all. Though many, like my- 
self, have found the concept of overall decision stimulating, it is cer- 
tainly highly unrealistic and in many contexts unwieldy.t Any claim 
to realism made by this book—or indeed by almost any theory of per- 
sonal decision of which I know—is predicated on the idea that some of 
the individual decision situations into which actual people tend to sub- 
divide the single grand decision do recapitulate in microcosm the mech- 
anism of the idealized grand decision. One application of the theory 
of utility to overall decisions has, however, been attempted by Milton 
Friedman in [F11]. 

The problem of this section is to say as clearly as possible what con- 
stitutes a satisfactory isolated decision situation. The general method 
of attack I propose to follow, for want of a better one, is to talk in terms 
of the grand situation—tongue in cheek—and in those terms to analyze 
and discuss isolated decision situations. I hope you will be able to 
agree, as the discussion proceeds, that I do not lean too heavily on the 
concept of the grand decision situation. 

Consider a simple example. Jones is faced with the decision whether 
to buy a certain sedan for a thousand dollars, a certain convertible also 
for a thousand dollars, or to buy neither and continue carless. The 
simplest analysis, and the one generally assumed, is that Jones is de- 
ciding between three definite and sure enJoyments, that of the sedan, 
the convertible, or the thousand dollars. Chance and uncertainty are 
considered to have nothing to do with the situation. This simple anal- 
ysis may well be appropriate in some contexts; however, it is not diffi- 
cult to recognize that Jones must in fact take account of many uncer- 
tain future possibilities in actually making his choice. The relative 


t Unrealistic though the concept is, it would be a mistake, arising out of elliptical 
presentation, to suppose that the concept predicates the choice of a complete life- 
long policy by new-born babies. If a person ever reached such a level of maturity 
as to be able to make a lifelong choice for his life from that time on, he would then 
become a person to whom the concept could be literally applied. 
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fragility of the convertible will be compensated only if Jones’s hope to 
arrange a long vacation in a warm and scenic part of the country ac- 
tually materializes; Jones would not buy a car at all if he thought it 
likely that he would immediately be faced by a financial emergency 
arising out of the sickness of himself or of some member of his family; 
he would be glad to put the money into a car, or almost any durable 
goods, if he feared extensive inflation. This brings out the fact that 
what are often thought of as consequences (that is, sure experiences of 
the deciding person) in isolated decision situations typically are in re- 
ality highly uncertain. Indeed, in the final analysis, a consequence is 
an idealization that can perhaps never be well approximated. I there- 
fore suggest that we must expect acts with actually uncertain conse- 
quences to play the role of sure consequences in typical isolated decision 
situations. 

Suppose now, to elaborate the example, that Jones is presented with 
a choice between tickets in several different lotteries such that, which- 
ever he chooses and whatever tickets are drawn, he will win either 
nothing, the sedan, the convertible, or a thousand dollars. None of 
these four consequences—not even ‘‘nothing’’—is actually a sure con- 
sequence in the strict sense, as I think you will now understand. I 
propose to analyze Jones’s present decision situation in terms of a 
“small world.”’ The more colloquial Greek word, microcosm, will be 
reserved for a special kind of small world to be described later. To de- 
scribe the state of the small world is to say which prize is associated 
with each of the tickets offered to Jones. The small-world acts actually 
available to Jones are acceptance of one or another of the tickets. 
The generic small-world act is an arbitrary function taking as its value 
one of the four small-world consequences according to which small- 
world state obtains. 

It will be noticed that the small-world states are in fact events in 
the grand world, that indeed they constitute a partition of the grand 
world. If there are an infinite number of small-world states, as indeed 
there must be, if the small world is to satisfy the postulates P1—7, then 
the partitic.. in question becomes an infinite partition.t These con- 
siderations lead to the following technical definitions. 

Let the grand world S be, as always, a set with elements s, s’, --- 
The grand-world consequences F may as well be taken to be a bounded 


+ Technical note: It is mathematically more general and elegant not to insist that 
the small world have states at all, but rather to speak of a special class of events as 
small-world events. This class should be closed under complements and finite unions. 
In short, the small-world events, and thereby the small world itself, constitute a 
Boolean subalgebra of the Boolean algebra of the grand-world events. 
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set of real numbers. The grand-world acts are then real-valued func- 


tions f, g, h, ---. The preference ordering between acts is determined 
by the condition that f < g if and only if 


where the expected value indicated in (1) is derived from a probability 
measure P characteristic of the grand world or, to be more exact, of 
the person’s attitude toward the grand world. 

The construction of a small world S from the grand world S begins 
with the partition of S into subsets, or small-world states 3, 8’, --- (not 
necessarily finite in number). Throughout this technical discussion, it 
will be necessary to bear in mind certain double interpretations such 
as that 5 is both an element of S and a subset of S. Strictly speaking, a 
small-world event B in S is a collection of subsets of S and not itself a 
subset of S. However, the union of all the elements of B, regarded as 
subsets of S, is an event in S; call it [B]. 

The small world, as I mean to define it, is determined not only by 
the definition of a state, but also by the definition of small-world con- 
sequences. A small-world consequence is a grand-world act. A set F of 
grand-world acts, regarded as small-world consequences, is thus part of 
the definition of any given small world. It will be mathematically 
simplest, and cost little if anything in insight, to suppose that the ele- 
ments of F are finite in number. They will be denoted f, g, h, ---; 
and, when the small-world consequence f is recognized as a grand-world 
act, f(s) will denote the grand-world consequence of f at the grand- 
world state s. 

A small-world act f is, of course, a function from small-world states § 
to small-world consequences f. In this isolated technical discussion, we 
will hobble along with the notations f(§) for the small-world conse- 
quence attached to § by f, and f(s; 8) for the grand-world consequence 
attached to-s by f(§) recognized as a grand-world act. Each small- 
world act f gives rise to a unique grand-world act f, defined thus: 


(2) f(s) = pr f(s; 8(s)), 


where S(s) means that small-world state § of which the grand-world 
state s is an element. 

The distinction between f and f, like some other distinctions I have 
thought it worth while to make in the present complicated context, is 
perhaps pedantic. At any rate, it is to be understood as part of the 
definition of a small world that f < & if and only if f < &, that is, in 
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view of (1), if and only if E(f) < E(é). In this connection, it is useful 
to note that 


(3) E(@) = 3) E@|F@(s)) = HPF@s)) = &) 
keF 
= ) Ee | f(a(s)) = B)PF@(s)) = &). 
k 


It may be advantageous to review (3), and thereby the whole techni- 
cal definition of a small world, in terms of an example. A small-world 
act, typified by the purchase of a lottery ticket, amounts to accepting 
the consequences of one of several ordinary grand-world acts according 
to which element of a partition does in fact obtain. For example, the 
participant in a lottery may drive away a car, lead away a goat, face 
a firing squad, or remain in the status quo, according to the terms of 
the lottery and according to which ticket he has in fact drawn. Letting 
the example of the lottery stand for the general situation, the expected 
utility of a lottery ticket can be computed by the partition formula 
(3.5.3) from the conditional expectation associated with each ticket, 
which is what (3) does. 

It may fairly be said that a lottery prize is not an act, but rather the 
opportunity to choose from a number of acts. Thus a cash prize puts 
its possessor in a position to choose among many purchases he could 
not otherwise afford. I believe that analysis to be more nearly correct, 
but it is more complicated; and, if one thinks of each set of acts made 
available by a lottery prize as represented by a best act of that set, 
the more complicated analysis seems superfluous, at least in a first 
attack. 

A small world is completely satisfactory for the use to which I mean 
to put it, if and only if it itself satisfies the seven postulates and leads 
to—more technically, agrees with—a probability P such that 


(4) P(B) = P((B)) 
for all B C S and has a utility U such that 
(5) U(f) = E(f) 


for allfeF. For the present context, call such a completely satisfac- 
tory small world a microcosm; if the small world satisfies the postulates, 
but does not necessarily admit P as its probability nor U as a utility, 
call it a pseudo-microcosm. 

To display the circumstances under which a small world is a pseudo- 
microcosm, I shall briefly comment on each of the postulates in the 
form given on the end papers of this book, referring to them here as 


5.5] SMALL WORLDS 87 


P1-7, as opposed to P1-7, to emphasize that they are here being con- 
sidered with respect to 8 and F. 


Pl Simple ordering. 
Automatically satisfied. Indeed it is directly implied by P1. 
P2 Conditional preference well defined. 
Automatic. 
P3 Conditional preference does not effect consequences. 
Requires exactly that, for every f, 9¢ F, and B CS, either: 
a. f <Ggiven[B], _ if and only iff < g, or 
b. h<kgiven[B],  foreveryh, ke F. 


In these inequalities the elements of Ff’ are of course interpreted as 
grand-world acts. 


P4 Qualitative personal probability well defined. 
Requires exactly that, if f < g and hg < hg, where 
ha(s) =g + forse [B] 
f forse ~[B]° 


I 


(6) 
for s ¢[C] 


for s « ~[C]; 


g 
then h’s < h’a, where h’g and h’g are defined in terms of f’, 9’, f’ < 9’, 
in analogy with (6). _ 
This postulate is automatic in case F’ has at most two elements. 


hes) = 


P5 The person has some definite preference. 
Requires f < g for some f, g ¢ F. 
P6 Partition of worlds into tiny events. 


It is clear that this postulate is not automatic, that is, it is not im- 
plied by the validity of P1-7 for the grand world. It is not even im- 
plied by P1-7 together with P1-5, though in the presence of all these 
P6 could undoubtedly be weakened. There seems to be little to gain 
in the present context by reducing P6 to such minimal terms, nor by 
expressing it, as P1-5 have been expressed, in grand-world terms alone; 
for P6 does not lend itself easily to such treatment, though it would be 
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easy to decide in any instance whether P6 obtained without undue 
reference to the grand world. 


P7 — Strong form of sure-thing principle. 


Automatic, in view of the explicit assumption that F has only a 
finite number of elements. 

To summarize, a small world is a pseudo-microcosm, if and only if 
it satisfies P3-6. The possibility of enlarging an arbitrary small world 
in such a way as to satisfy those conditions has already been implicitly 
discussed in connection with P3-6. To recall the arguments that were 
adduced, one might review the example about the egg in §3.1, and 
the further discussion of that example in the opening paragraph of 
§ 3.2; the remark in § 3.2, introducing P5; and the example about the 
coin following P6’ in § 3.3. 

It is encouraging to possess the arguments Just cited tending to show 
that any small world can without overwhelming difficulty be embedded 
in a somewhat larger small world that is a pseudo-microcosm. A pseudo- 
microcosm is, however, completely satisfactory, only if it is actually a 
microcosm, that is, only if it leads to a probability measure and a 
utility well articulated with those of the grand world. The problem of 
deciding under what circumstances that occurs is much facilitated by 
the fact that the probability measure and a utility of a pseudo-micro- 
cosm can be written down explicitly, as the next few paragraphs show. 

To study the problem, suppose the small world is a pseudo-micro- 
cosm. Then, in view of P5, let g, h be elements of F’ such that g < h, 
and let | 

ee E(h — g | [B)) - 
7 B) =p: ————- P((B 
(7) Q(B) =pr Eh) ([B]) 


= Eh - 9) fi — g(s)} dP(s). 


By using P3 to check the positivity, it is easily verified that Q is a prob- 
ability measure on 8S. The probability measure Q agrees with the re- 
lation < between small-world events, which is easily verified on re- 
writing (3) for the special small-world act fg that takes the value h 
for §<¢ Band g for §<¢~B thus: 


(8) E(fg) = E(h| (B)P(B) + E@| ~[BDP(W(B) 
E(h — g| (B) P(B) + E@ 
= E(h — 9)Q(B) + EG). 
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Since g and A are essentially arbitrary, there are many ways to con- 
struct a probability measure that agrees with the relation < between 
small-world events, but, in the presence of P1-6, all of them must (in 
view of Corollary 3.3.1) be the same as Q. That consideration leads to 
the formula 


(9) Ef —f’ | (B)P(B)) = EG — PQ) 


for all f, f/eF and BCS. 
Using (9) and recalling that U(f) has been defined as E(f), (3) can 
be rewritten thus: 


(10) E(t) = EG) + 2X E(k — g | F(&(s)) = b)P(F(&(s)) = &) 
= x U(kK)Q(F(3) = &). 


The question whether a given pseudo-microcosm is really a micro- 
cosm is the question whether Q(B) = P([B]) and whether U is a utility 
for the pseudo-microcosm. The answer to the second part is immediate 
and, I think, somewhat surprising, for (10) shows that for any pseudo- 
microcosm U is indeed a utility. 

Unfortunately, the condition Q(B) = P([B]) is not also automatic. 
The possibility of its failing to be satisfied is illustrated by the following 
simple mathematical example. Let S be the unit square 0 < 2, y < 1, 
and let 


1 1] 
(11) E(t) = f { fla, y) dex dy. 
0 0 


It is of no real moment that the integral in (11), if understood in the 
Lebesgue or Riemann sense, is not defined for all bounded functions. 
Let the elements of S be the vertical line segments, x = constant. 
Finally, suppose that the elements of F consist of the function zero and 
any finite number of non-negative multiples of a fixed positive function 
h =h. It is easy to verify that S as thus defined is a pseudo-microcosm 
and that 


(12) QB) =f al’) de’ 


where ; 
J re, » ay 
0 


(13) g(x’) = 
{ h(x, y) dx dy 
0 40 
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Unless q is 1 for every x’, which will not at all typically be the case, S 
is not really a microcosm. 

The general condition that a pseudo-microcosm be a microcosm—i.e., 
that Q(B) = P([B])—is evidently, in view of (9), 

(14) E(f — f' | (Bl) = EG -f) 

for every f, f’<F and every B for which P([B]) > 0. Incidentally, 
that condition alone practically implies that a small world S, not néces- 
sarily assumed to be a pseudo-microcosm, is a real microcosm. More 
exactly, it implies all the postulates P1-7, except P6; and it implies 
that the probability measure P agrees with the relation < between 
small-world events. Also, if a small world is a pseudo-microcosm, it is 
enough that (14) should hold for some pair of functions for which the 
right-hand side of the equation does not vanish. 

Equation (14) is, however, unsatisfactory in that it seems incapable 
of verification without taking the grand world much too seriously. 
Some consolation may derive from the fact that if f and f’ are constants 
they automatically satisfy (14). Two such absolute, or grand-world, 
consequences would suffice, for, as has just been remarked, it is suffi- 
cient that (14) be satisfied for two materially different small-world 
consequences, in the presence of P1-7 (which are verifiable without 
any detailed knowledge of the grand world). It must, however, be ad- 
mitted, as has already been mentioned, that the very idea of a grand- 
world consequence takes the grand world pretty seriously—a point 
forced into my reluctant mind by a conversation with Francesco Bram- 
billa. 

I feel, if I may be allowed to say so, that the possibility of being taken 
in by a pseudo-microcosm that is not a real microcosm is remote, but 
the difficulty I find in defining an operationally applicable criterion is, 
to say the least, ground for caution. 

There certainly seem to be cases in which one could confidently as- 
sume (14), though thus far formal analysis of the source of such se- 
curity escapes me. Consider, for example, a lottery in which numbered 
tickets are drawn from a drum. It seems clear that for an ordinary 
person the outcome of the lottery is utterly irrelevant to his life, except 
through the rules of the lottery itself. In other terms equally loose, 
the value of a thousand dollars, or of a car, to a person would not ordi- 
narily depend at all on what numbers were drawn in a lottery, unless 
the person himself (or perhaps some other person or organization with 
whom he had some degree of contact) held tickets in the lottery. A 
more precise formulation, which does indeed imply (14), is that the 
events that represent the outcome of the lottery are all statistically 
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independent of the grand-world acts, or functions, that typically enter 
as prizes in a lottery. This suggests once more that it would be desir- 
able, if possible, to find a simple qualitative personal description of in- 
dependence between events. (Compare the first paragraph after 
(3.5.2).) 


6 Historical and critical comments on utility 


A casual historical sketch of the concept of utility will perhaps have 
some interest as history. At any rate, most of the critical ideas per- 
taining to utility that I wish to discuss find their places in such a sketch 
as conveniently as in any other organization I can devise. Much more 
detailed material on the history of utility, especially in so far as the 
economics of risk bearing is concerned, is to be found in Arrow’s review 
article [A6]. Stigler’s historical study [S18] emphasizes the history of 
the now almost obsolete economic notion of utility in riskless situations, 
a notion still sometimes confused with the one under discussion. 

As was mentioned in § 4.5, the earliest mathematical studies of prob- 
ability were largely concerned with gambling, particularly with the 
question of which of several available cash gambles is most advanta- 
geous. Early probabilists advanced the maxim that the gamble with 
the highest expected winnings is best or, in terms of utility, that wealth 
measured in cash is a utility function. Some sense can be seen in that 
maxim, which will here be called by its traditional though misleading 
name, the principle of mathematical expectation. First, it has often been 
argued that the principle follows for the long run from the weak law of 
large numbers, applied to large numbers of independent bets, in each 
of which only sums that the gambler considers small are to be won or 
lost. Second, Daniel Bernoulli, who, in [B10], was one of the first to 
introduce a general idea of utility corresponding to that developed in 
the preceding three sections, made the following analysis of the princi- 
ple, which justifies its application in limited but important contexts. 
If the consequences f to be considered are all quantities of cash, it is 
reasonable to suppose that U(f) will change smoothly with changes in 
f. Therefore, if a person’s present wealth is fo, and he contemplates 
various gambles, none of which can greatly change his wealth, the 
utility function can, for his particular purpose, be approximated by its 
tangent at fo, that is, 


(1) U(f) ~ U(fo) + (F — fo) U'(fo), 


a linear function of f. Since a constant term is irrelevant to any com- 
parison of expected values, the approximation amounts to regarding 
utility as proportional to wealth, that is, to following the principle of 
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mathematical expectation. So far as I know, the only other argument 
for the principle that has ever been advanced is one concerning equity 
between two players. As Bernoulli says, that argument is irrelevant at 
best; and neither of the relevant arguments justifies categorial accept- 
ance of the principle. None the less, the principle was at first so cate- 
gorically accepted that it seemed paradoxical to mathematicians of the 
early eighteenth century that presumably prudent individuals reject 
the principle in certain real and hypothetical decision situations. 

Daniel Bernoulli (1700-1782), in the paper [B10], seems to have 
been the first to point out that the principle is at best a rule of thumb, 
and he there suggested the maximization of expected utility as a more 
valid principle. Daniel Bernoulli’s paper reproduces portions of a let- 
ter from Gabriel Cramer to Nicholas Bernoulli, which establishes 
Cramer’s chronological priority to the idea of utility and most of the 
other main ideas of Bernoulli’s paper. But it is Bernoulli’s formulation 
together with some of the ideas that were specifically his that became 
popular and have had widespread influence to the present day. It is 
therefore appropriate to review Bernoulli’s paper in some detail. 

Being unable to read Latin, I follow the German edition [B11]. 

Bernoulli begins by reminding his readers that the principle of mathe- 
matical expectation, though but weakly supported, had theretofore 
dominated the theory of behavior in the face of uncertainty. He says 
that, though many arguments had been given for the principle, they 
were all based on the irrelevant idea of equity among players. It seems 
hard to believe that he had never heard the argument justifying the 
principle for the long run, even though the weak law of large numbers 
was then only in its mathematical infancy. Ars Conjectandi [B12], then 
a fairly up-to-date and most eminent treatise on probability, does seem 
to give only the argument about equity, and that in countless forms. 
This treatise by Daniel’s uncle, Jacob (= James) Bernoulli (1654-1705), 
incidentally, contains the first mathematical advance toward the weak 
law, proving it for the special case of repeated trials. 

Many examples show that the principle of mathematical expecta- 
tion is not universally applicable. Daniel Bernoulli promptly presents 
one: ““To justify these remarks, let us suppose a pauper happens to ac- 
quire a lottery ticket by which he may with equal probability win 
either nothing or 20,000 ducats. Will he have to evaluate the worth 
of the ticket as 10,000 ducats; and would he be acting foolishly, if he 
sold it for 9,000 ducats? ”’ 

Other examples occur later in the paper as illustrations of the use 
of the utility concept. Thus a prudent merchant may insure his ship 
against loss at sea, though he understands perfectly well that he is 
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thereby increasing the insurance company’s expected wealth, and to 
the same extent decreasing his own. Such behavior is in flagrant vio- 
lation of the principle of mathematical expectation, and to one who held 
that principle categorically it would be as absurd to insure as to throw 
money away outright. But the principle is neither obvious nor de- 
duced from other principles regarded as obvious; so it may be challenged, 
and must be, because everyone agrees that it is not really insane to 
insure. 

Bernoulli cites a third, now very famous, example illustrating that 
men of prudence do not invariably obey the principle of mathematical 
expectation. This example, known as the St. Petersburg paradox (be- 
cause of the journal in which Bernoulli’s paper was published) had ear- 
lier been publicized by Nicholas Bernoulli,t and Daniel acknowledges 
it as the stimulus that led to his investigation of utility. Suppose, to 
state the St. Petersburg paradox succinctly, that a person could choose 
between an act leaving his wealth fixed at its present magnitude or one 
that would change his wealth at random, increasing it by (2” — f) dol- 
lars with probability 2~” for every positive integer n. No matter how 
large the admission fee f may be, the expected income of the random 
act is infinite, as may easily be verified. Therefore, according to the 
principle of mathematical expectation, the random act is to be pre- 
ferred to the status quo. Numerical examples, however, soon convince 
any sincere person that he would prefer the status quo if f is at all 
large. If f is $128, for example, there is only 1 chance in 64 that a 
person choosing the random act will so much as break even, and he 
will otherwise lose at least $64, a jeopardy for which he can seek com- 
pensation only in the prodigiously improbable winning of a prodigiously 
high prize. 

Appealing to intuition, Bernoulli says that the cash value of a per- 
son’s wealth is not its true, or moral, worth to him. Thus, according to 
Bernoulli, the dollar that might be precious to a pauper would be nearly 
worthless to a millionaire—or, better, to the pauper himself were he to 
become a millionaire. Bernoulli then postulates that people do seek 
to maximize the expected value of moral worth, or what has been called 
moral expectation. 

Operationally, the moral worth of a person’s wealth, so far as it con- 
cerns behavior in the face of uncertainty, is just what I would call the 
utility of the wealth, and moral expectation is expectation of utility. 


t Daniel refers to this Nicholas Bernoulli as his uncle, but, in view of dates men- 
tioned in the last section of Daniel’s paper and the genealogy in Chapter 8 of [B9], 
I think he must have meant his elder cousin (1687-1759), perhaps using “uncle’’ as 
a term of deference. 
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It seems mystical, however, to talk about moral worth apart from 
probability and, having done so, doubly mystical to postulate that this 
undefined quantity serves as a utility. These obvious criticisms have 
naturally led many to discredit the very idea of utility, but §§ 2-4 
show (following von Neumann and Morgenstern) that there is a more 
cogent, though not altogether unobjectionable, path to that concept. 
Bernoulli argued, elaborating the example of the pauper and the 
millionaire, that a fixed increment of cash wealth typically results in 
an ever smaller increment of moral wealth as the basic cash wealth to 
which the increment applies is increased. He admitted the possibility 
of examples in which this law of diminishing marginal utility, as it has 
come to be called in the literature of economics, might fail. For ex- 
ample, a relatively small sum might be precious to a wealthy prisoner 
who required it to complete his ransom. But Bernoulli insisted that 
such examples are unusual and that as a general rule the law may be 
assumed. In mathematical terms, the law says that utility as a func- 
tion of money is a concave (i.e., the negative of a convex) function. f 
It follows from the basic inequality concerning convex functions (Theo- 
rem 1 of Appendix 2) that a person to whom the law of diminishing 
marginal utility applies will always prefer the status quo to any fair 
gamble, that is, to any random act for which the change in his expected 
wealth is zero, and that he will always be willing to pay something in 
addition to its actuarial, or expected, value for insurance against any 
loss to himself. The Jaw of diminishing marginal utility has been very 
popular, and few who have considered utility since Bernoulli have dis- 
carded it, or even realized that it was not necessarily part and parcel 
of the utility idea. Of course, the law has been embraced eagerly and 
uncritically by those who have a moral aversion to gambling. 
Bernoulli went further than the law of diminishing marginal utility 
and suggested that the slope of utility as a function of wealth might, 
at least as a rule of thumb, be supposed, not only to decrease with, but 
to be inversely proportional to, the cash value of wealth. This, he 
pointed out, is equivalent to postulating that utility is equal to the 
logarithm (to any base) of the cash value of wealth. To this day, no 
other function has been suggested as a better prototype for Everyman’s 
utility function. None the less, as Cramer pointed out in his aforemen- 
tioned letter, the logarithm has a serious disadvantage; for, if the loga- 
rithm were the utility of wealth, the St. Petersburg paradox could be 


{ Often the meanings of ‘‘convex” and ‘‘concave’’ as applied to functions are in- 
terchanged. A function is here called convex if it appears convex, in the ordinary 
sense of the word, when viewed from below. Such a function is, of course, also con- 
cave from above, whence the confusion. Cf. Appendix 2. 
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amended to produce a random act with an infinite expected utility 
(i.e., an infinite expected logarithm of income) that, again, no one would 
really prefer to the status quo. To take a less elaborate example, sup- 
pose that a man’s total wealth, including an appraisal of his future 
earning power, were a million dollars. If the logarithm of wealth were 
actually his utility, he would as soon as not flip a coin to decide whether 
his wealth should be changed to ten thousand dollars—roughly $500 
per year—or a hundred million dollars. This seems preposterous to 
me. At any rate, I am sure you can construct an example along the 
same lines that will seem preposterous to you. Cramer therefore con- 
cluded, and I think rightly, that the utility of cash must be bounded, 
at least from above. It seems to me that a good argument can also be 
adduced for supposing utility to be bounded from below, for, however 
wealth may be interpreted, we all subject our total wealth to slight 
jeopardy daily for the sake of a large probability of avoiding more 
moderate losses. But the logarithm is unbounded both from above 
and from below; so, though it might be a reasonable approximation to 
a person’s utility in a moderate range of wealth, it cannot be taken 
seriously over extreme ranges. 

Bernoulli’s ideas were accepted wholeheartedly by Laplace [L1], who 
was very enthusiastic about the applications of probability to all sorts 
of decision problems. It is my casual impression, however, that from 
the time of Laplace until quite recently the idéa of utility did not 
strongly influence either mathematical or practical probabilists. 

For a long period economists accepted Bernoulli’s idea of moral 
wealth as the measurement of a person’s well-being apart from any 
consideration of probability. Though “utility”? rather than ‘moral 
worth” has been the popular name for this concept among English- 
speaking economists, it is my impression that Bernoulli’s paper is the 
principal, if not the sole, source of the notion for all economists, though 
the paper itself may often have been lost sight of. Economists were for 
a time enthusiastic about the principle of diminishing marginal utility, 
and they saw what they believed to be reflections of it in many aspects 
of everyday life. Why else, to paraphrase Alfred Marshall (pp. 19, 
95 of [M2]), does a poor man walk in a rain that induces a rich man to 
take a cab? 

During the period when the probability-less idea of utility was popu- 
lar with economists, they referred not only to the utility of money, 
but also to the utility of other consequences such as commodities (and 
services) and combinations (or, better, patterns of consumption) of com- 
modities. The theory of choice among consequences was expressed by 
the idea that, among the available consequences, a person prefers those 
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that have the highest utility for him. Also, the idea of diminishing 
marginal utility was extended from money to other commodities. 

The probability-less idea of utility in economics has been completely 
discredited in the eyes of almost all economists, the following argument 
against it—originally advanced by Pareto in pp. 158-159 and the 
Mathematical Appendix of [Pl]—being widely accepted. If utility is 
regarded as controlling only consequences, rather than acts, it is not 
true—as it is when acts, or at least gambles, are considered and the 
formal definition in §3, is applied—that utility is determined except 
for a linear transformation. Indeed, confining attention to conse- 
quences, any strictly monotonically increasing function of one utility 
is another utility. Under these circumstances there is little, if any, 
value in talking about utility at all, unless, of course, special economic 
considerations should render one utility, or say a linear family of utili- 
ties, of particular interest. That possibility remains academic to date, 
though one attempt to exploit it was made by Irving Fisher, as is briefly 
discussed in the paragraph leading to Footnote 155 of [S18]. In par- 
ticular, utility as a function of wealth can have any shape whatsoever 
in the probability-less context, provided only that the function in ques- 
tion is increasing with increasing wealth, the provision following from 
the casual observation that almost nobody throws money away. The 
history of probability-less utility has been thoroughly reported by Stig- 
ler [S18]. 

What, then, becomes of the intuitive arguments that led to the no- 
tion of diminishing marginal utility? To illustrate, consider the poor 
man and the rich man in the rain. Those of us who consider diminish- 
ing marginal utility nonsensical in this context think it sufficient to 
say simply that it is a common observation that rich men spend money 
freely to avoid moderate physical suffering whereas poor men suffer 
freely rather than make corresponding expenditures of money; in other 
terms, that the rate of exchange between circumstances producing phys- 
ical discomfort and money depends on the wealth of the person involved. 

In recent years there has been revived interest in Bernoulli’s ideas 
of utility in the technical sense of §§ 2-4, that is, as a function that, so 
to speak, controls decisions among acts, or at least gambles. Ramsey’s 
essays in [R1], which in spirit closely resemble the first five chapters of 
this book, present a relatively early example of this revival of interest. 
Ramsey improves on Bernoulli in that he defines utility operationally 
in terms of the behavior of a person constrained by certain postulates. 
Ramsey’s essays, though now much appreciated, seem to have had 
relatively little influence. 

Between the time of Ramsey and that of von Neumann and Morgen- 
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stern there was interest in breaking away from the idea of maximizing 
expected utility, at least so far as economic theory was concerned (ef. 
(T1a]). This trend was supported by those who said that Bernoulli gives 
no reason for supposing that preferences correspond to the expected 
value of some function, and that therefore much more general possi- 
bilities must be considered. Why should not the range, the variance, 
and the skewness, not to mention countless other features, of the dis- 
tribution of some function join with the expected value in determining 
preference? The question was answered by the construction of Ramsey 
and again by that of von Neumann and Morgenstern, which has been 
slightly extended in §§ 2-4; it is simply a mathematical fact that, al- 
most any theory of probability having been adopted and the sure-thing 
principle having been suitably extended, the existence of a function 
whose expected value controls choices can be deduced. That does not 
mean that as a theory of actual economic behavior the theory of utility 
is absolutely established and cannot be overthrown. Quite the con- 
trary, it is a theory that makes factual predictions many of which can 
easily be observed to be false, but the theory may have some value in 
making economic predictions in certain contexts where the departures 
from it happen not to be devastating. Moreover, as I have been argu- 
ing, it may have value as a normative theory. 

Von Neumann and Morgenstern initiated among economists and, to 
a lesser extent, also among statisticians an intense revival of interest 
in the technical utility concept by their treatment of utility, which ap- 
pears as a digression in [V4]. 

The von Neumann-Morgenstern theory of utility has produced this 
reaction, because it gives strong intuitive grounds for accepting the 
Bernoullian utility hypothesis as a consequence of well-accepted maxims 
of behavior. To give readers of this book some idea of the von Neu- 
mann-Morgenstern theory, I may repeat that the treatment of utility 
as applied to gambles presented in §3 is virtually copied from their 
book [V4]. Indeed, their ideas on this subject are responsible for almost 
all of my own. One idea now held by me that I think von Neumann 
and Morgenstern do not explicitly support, and that so far as I know 
they might not wish to have attributed to them, is the normative in- 
terpretation of the theory. 

Of course, much of the new interest in utility takes the form of criti- 
cism and controversy. The greater part of this discussion that has come 
to my attention has not yet been published. A list of references lead- 
ing to most of that which has is [B7], [W14], [S1], [C4], [F13], [A2]. 

I shall successively discuss each of the recent major criticisms of the 
modern theory of utility known to me. My method in each case will 
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be first to state the criticism in a form resembling those in which it is 
typically put forward, regardless of whether I consider that form well 
chosen. I will then discuss the criticism, elaborating its meaning and 
indicating its rebuttal, when there seems to me to be one. 


(a) Modern economic theorists have rigorously shown that there is 
no meaningful measure of utility. More specifically, if any function U 
fulfills the role of a utility, then so does any strictly monotonically in- 
creasing function of U. It must, therefore, be an error to conclude that 
every utility is a linear function of every other. 


This argument has been advanced with a seriousness that is surpris- 
ing, considering that it concedes little intelligence or learning to the 
proponents of the utility theory under discussion and considering that 
it results, as will immediately be explained, from the baldest sort of a 
terminological confusion. To be fair, I must go on to say that I have 
never known the argument to be defended long in the presence of the 
explanation I am about to give. 

In ordinary economic usage, especially prior to the work of von Neu- 
mann and Morgenstern, a utility associated with gambles would pre- 
sumably be simply a function U associating numbers with gambles in 
such a way that f < g, if and only if U(f) < U(g); though economic 
discussion of utility was, prior to von Neumann and Morgenstern, al- 
most exclusively confined to consequences rather than to gambles or 
to acts. It is unequivocally true, as I have already brought out, that 
any monotonic function of a utility in this wide classical sense is itself 
a utility. What von Neumann and Morgenstern have shown, and 
what has been recapitulated in § 3, is that, granting certain hypotheses, 
there exists at least one classical utility V satisfying the very special 
condition 


(2) V(af + Bg) = aV(F) + BV(g), 


where f and g are any gambles and a, @ are non-negative numbers such 
that a+ 6 = 1. Furthermore, if I may for the moment call a classical 
utility satisfying (2) a von Neumann-Morgenstern utility, every von 
Neumann-Morgenstern utility is an increasing linear function of every 
other. To put the point differently, the essential conclusion of the von 
Neumann-Morgenstern utility theory is that (2) can be satisfied by a 
classical utility, but not by very many. The confusion arises only be- 
cause von Neumann and Morgenstern use the already pre-empted word 
“utility” for what I here call ‘von Neumann-Morgenstern utility.” 
In retrospect, that seems to have been a mistake in tactics, but one of 
no long-range importance. 
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(b) The postulates leading to the von Neumann-Morgenstern con- 
cept of utility are arbitrary and gratuitous. 


Such a view can, of course, always be held without the slightest fear 
of rigorous refutation, but a critic holding it might perhaps be persuaded 
away from it by a reformulation of the postulates that he might find 
more appealing than the original set, or by illuminating examples. In 
particular, P1-7 are quite different from, but imply, the postulates of 
von Neumann and Morgenstern. Incidentally, the main function of 
the von Neumann-Morgenstern postulates themselves is to put the es- 
sential content of Daniel Bernoulli’s “postulate” into a form that is 
less gratuitous in appearance. At least one serious critic, who had at 
first found the system of von Neumann and Morgenstern gratuitous, 
changed his mind when the possibility of deriving certain aspects of 
that system from the sure-thing principle was pointed out to him. 


(c) The sure-thing principle goes too far. For example, if two lot- 
teries with cash prizes (not necessarily positive) are based on the same 
set of lottery tickets and so arranged that the prize that will be assigned 
to any ticket by the second lottery is at least as great as the prize as- 
signed to that ticket by the first lottery, then there is no doubt that 
virtually any person would find a ticket in the first lottery not prefer- 
able to the same ticket in the second lottery. If, however, the prizes 
in each lottery are themselves lottery tickets, such that the prize asso- 
ciated with any ticket in the first lottery is not preferred by the person 
under study to the prize associated with the same ticket by the second 
lottery, the conclusion that the person will not prefer a ticket in the 
first lottery to the same ticket in the second is no longer compelling. 


This point resembles the preceding one in that the intuitive appeal 
of an assumption can at most be indicated, not proved. I do think it 
cogent, however, to stress in connection with this particular point that 
a cash prize is to a large extent a lottery ticket in that the uncertainty 
as to what will become of a person if he has a gift of a thousand dollars 
is not in principle different from the uncertainty about what will be- 
come of him if he holds a lottery ticket of considerable actuarial value. 

Perhaps an adherent to the criticism in question would think it rele- 
vant to reply thus: Though cash sums are indeed essentially lottery 
tickets, a sum of money is worth at least as much to a person as a smaller 
sum, in a peculiarly definite and objective sense, because money can, 
if one desires, always be quickly and quietly thrown away, thereby 
making any sum available to a person who already has a larger sum. 
But I have never heard that reply made, nor do I here plead its cogency. 
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(d) An actual systematic deviation from the sure-thing principle and, 
with it, from the von Neumann-Morgenstern theory of utility, can be 
exhibited. For example, a person might perfectly reasonably prefer to 
subsist on a packet of Army K rations per meal than on two ounces of 
the best caviar per meal. It is then to be expected, according to the 
sure-thing principle, that the person would prefer the K rations to a 
lottery ticket yielding the K rations with probability 9/10 and the 
caviar diet with probability 1/10. That expectation is no doubt ful- 
filled, if the lottery is understood to determine the person’s year-long 
diet once and for all. But, if the person is able to have at each meal a 
lottery ticket offering him the K rations or the caviar with the indicated 
probabilities, it is not at all unlikely, granting that he likes caviar and 
has some storage facilities, that he will prefer this “lottery diet.’’ This 
conclusion is in defiance of the principle that ‘‘the theory of consumer 
demand is a static theory.”” (Cf. [W14].) 


I admit that the theory of utility is not static in the indicated sense, 
as the foregoing example conclusively shows. But there is not the 
slightest reason to think of a lottery producing either a steady diet of 
caviar or a steady diet of K rations as being the same lottery as one 
having a multitude of different prizes almost all of which are mixed 
chronological programs of caviar and K rations. The fact that a theory 
of consumer behavior in riskless situations happens to be static in the 
required sense (under certain special assumptions about storability and 
the linearity of prices) is no argument at all that the theory of consumer 
behavior in risky circumstances should be static in the same sense (as 
I mention in a note appended to [W14]). 


(e) If the von Neumann-Morgenstern theory of utility is not static, 
it is not subject to repeated empirical observation and is therefore 
vacuous. (Cf. [W14].) 


I think the discussion in § 3.1 of how to determine the preferences of 
a hot man for a swim, a shower, and a glass of beer, and the discussion 
in §5 of the practicality of identifying pseudo-microcosms are steps 
toward showing how the theory can be put to empirical test without 
making repeated trials on any one person. 


(f) Casual observation shows that real people frequently and fla- 
grantly behave in disaccord with the utility theory, and that in fact be- 
havior of that sort is not at all typically considered abnormal or ir- 
rational. 


Two different topics call for discussion under this heading. In the 
first place, it is undoubtedly true that the behavior of people does often 
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flagrantly depart from the theory. None the less, all the world knows 
from the lessons of modern physics that a theory is not to be altogether 
rejected because it is not absolutely true. It seems not unreasonable to 
suppose, and examples could easily be cited to confirm, that in the ex- 
tremely complicated subject of the behavior of people very crude theory 
can play a useful role in certain contexts. 

Second, many apparent exceptions to the theory can be so reinter- 
preted as not to be exceptions at all. For example, a flier may be ob- 
served doing a stunt that risks his life, apparently for nothing. That 
seems to be in complete violation of the theory; but, if in addition it is 
known that the flier has a real and practical need to convince certain 
colleagues of his courage, then he is simply paying for advertising with 
the risk of his life, which is not in itself in contradiction to the theory. 
Or, suppose that it were known more or less objectively that the flier 
has a need to demonstrate his own courage to himself. The theory 
would again be rescued, but this time perhaps not so convincingly as 
before. In general, the reinterpretation needed to reconcile various 
sorts of behavior with the utility theory is sometimes quite acceptable 
and sometimes so strained as to lay whoever proposes it open to the 
charge of trying to save the theory by rendering it tautological. The 
same sort of thing arises in connection with many theories, and I think 
there is general agreement that no hard-and-fast rule can be laid down 
as to when it becomes inappropriate to make the necessary reinterpre- 
tation. For example, the law of the conservation of energy (or its 
atomic age variant, the law of the conservation of mass and energy) 
owes its success largely to its being an expression of remarkable and 
reliable facts of nature, but to some extent also to certain conventions 
by which new sorts of energy are so defined as to keep the law true. 
A stimulating discussion of this delicate point in connection with the 
theory of utility is given by Samuelson in [S1]. 


(g) Introspection about certain hypothetical decision situations sug- 
gests that the sure-thing principle and, with it, the theory of utility 
are normatively unsatisfactory. Consider an example based on two de- 
cision situations each involving two gambles. f 


Situation 1. Choose between 


Gamble 1. $500,000 with probability 1; and 

Gamble 2. $2,500,000 with probability 0.1, 
$500,000 with probability 0.89, 
status quo with probability 0.01. 


+ This particular example is due to Allais [A2]. Another interesting example was 
presented somewhat earlier by Georges Morlat [C4]. 
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Situation 2. Choose between 


Gamble 3. $500,000 with probability 0.11, 
status quo with probability 0.89; and 

Gamble 4. $2,500,000 with probability 0.1, 
status quo with probability 0.9. 


Many people prefer Gamble 1 to Gamble 2, because, speaking quali- 
tatively, they do not find the chance of winning a very large fortune in 
place of receiving a large fortune outright adequate compensation for 
even a small risk of being left in the status quo. Many of the same 
people prefer Gamble 4 to Gamble 3; because, speaking qualitatively, 
the chance of winning is nearly the same in both gambles, so the one 
with the much larger prize seems preferable. But the intuitively ac- 
ceptable pair of preferences, Gamble 1 preferred to Gamble 2 and Gam- 
ble 4 to Gamble 3, is. not compatible with the utility concept or, equiva- 
lently, the sure-thing principle. Indeed that pair of preferences implies 
the following inequalities for any hypothetical utility function. 


U ($500,000) > 0.1U ($2,500,000) + 0.89U ($500,000) + 0.1U ($0), 


(3) 
0.1U ($2,500,000) + 0.9U ($0) > 0.11U ($500,000) + 0.89U ($0); 


and these are obviously incompatible. 

Examples { like the one cited do have a strong intuitive appeal; even 
if you do not personally feel a tendency to prefer Gamble 1 to Gamble 2 
and simultaneously Gamble 4 to Gamble 3, I think that a few trials 
with other prizes and probabilities will provide you with an example 
appropriate to yourself. 

If, after thorough deliberation, anyone maintains a pair of distinct 
preferences that are in conflict with the sure-thing principle, he must 
abandon, or modify, the principle; for that kind of discrepancy seems 
intolerable in a normative theory. Analogous circumstances forced 
D. Bernoulli to abandon the theory of mathematical expectation for 
that of utility [B10]. In general, a person who has tentatively accepted 
a normative theory must conscientiously study situations in which the 
theory seems to lead him astray; he must decide for each by reflection 
—deduction will typically be of little relevance—whether to retain his 
initial impression of the situation or to accept the implications of the 
theory for it. 

To illustrate, let me record my own reactions to the example with 


{ Allais has announced (but not yet published) an empirical investigation of the 
responses of prudent, educated people to such examples [A2]. 
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which this heading was introduced. When the two situations were 
first presented, I immediately expressed preference for Gamble 1 as 
opposed to Gamble 2 and for Gamble 4 as opposed to Gamble 3, and I 
still feel an intuitive attraction to those preferences. But I have since 
accepted the following way of looking at the two situations, which 
amounts to repeated use of the sure-thing principle. 

One way in which Gambles 1-4 could be realized is by a lottery with 
a hundred numbered tickets and with prizes according to the schedule 
shown in Table 1. 


TABLE 1. PRizES IN UNITS OF $100,000 IN A LOTTERY REALIZING 
GAMBLES 1-4 


Ticket Number 
1 2-11 12-100 
; ; Gamble 1 
Situation 1 ead 9 
: ; Gamble 3 
Situation 2 Gamble 


Now, if one of the tickets numbered from 12 through 100 is drawn, it 
will not matter, in either situation, which gamble I choose. I therefore 
focus on the possibility that one of the tickets numbered from 1 through 
11 will be drawn, in which case Situations 1 and 2 are exactly parallel. 
The subsidiary decision depends in both situations on whether I would 
sell an outright gift of $500,000 for a 10-to-1 chance to win $2,500,000— 
a conclusion that I think has a claim to universality, or objectivity. 
Finally, consulting my purely personal taste, I find that I would prefer 
the gift of $500,000 and, accordingly, that I prefer Gamble 1 to Gamble 
2 and (contrary to my initial reaction) Gamble 3 to Gamble 4. 

It seems to me that in reversing my preference between Gambles 3 
and 4 I have corrected an error. There is, of course, an important sense 
in which preferences, being entirely subjective, cannot be in error; but 
in a different, more subtle sense they can be. Let me illustrate by a 
simple example containing no reference to uncertainty. A man buying 
a car for $2,134.56 is tempted to order it with a radio installed, which 
will bring the total price to $2,228.41, feeling that the difference is 
trifling. But, when he reflects that, if he already had the car, he cer- 
tainly would not spend $93.85 for a radio for it, he realizes that he has 
made an error. 


One thing that should be mentioned before this chapter is closed is 
that the law of diminishing marginal utility plays no fundamental role 
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in the von Neumann-Morgenstern theory of utility, viewed either em- 
pirically or normatively. Therefore the possibility is left open that 
utility as a function of wealth may not be concave, at least in some in- 
tervals of wealth. Some economic-theoretical consequences of recog- 
nition of the possibility of non-concave segments of the utility function 
have been worked out by Friedman and myself [F12], and by Friedman 
alone [F111]. The work of Friedman and myself on this point is criti- 
cized by Markowitz [M1].+ 


+ See also Archibald (1959) and Hakansson (1970). 


CHAPTER 6 


Observation 


1 Introduction 


With the construction of utility, the theory of decision in the face 
of uncertainty is, in a sense, complete. I have no further postulates 
to propose, and those I have proposed have been shown to be equiva- 
lent to the assumption that the person always decides in favor of an 
act the expected utility of which is as large as possible, supposing for 
simplicity that only a finite number of acts are open to him. At the 
level of generality that has led to this conclusion there seems to be 
little or nothing left to say. To go further now means to go into more 
detail, to investigate special types of decision problems. One type of 
decision problem of central importance is that in which the person is 
called upon to make an observation and then to choose some act in the 
light of the outcome of the observation. 

The consideration of such observational decision problems is a step 
toward those problems of great interest for statistics in which the per- 
son must decide what observation to make, that is, of course, what to 
look at, not what to see. They are the problems of designing experi- 
ments and other observational programs. 

Some remarks on observation were made in Chapter 3, but only now 
that the theory of utility is established is it possible to give a relatively 
complete analysis of the concept. 

Observation is a concept essential to the study of statistics proper, 
most of what has been said thus far being preliminary to, but not really 
part of, statistics; even after this chapter and the next one, on obser- 
vation, there will still remain a major transition. One important fea- 
ture of much of what is ordinarily called statistics is, according to 
my analysis, concerned with the behavior not of an isolated person, but 
of a group of persons acting, for example, in concert. In later chapters 
I will deal, so far as I am able, with the problem of group action, but 
preliminary considerations bearing on it will be made and pointed out 
from time to time in this chapter and the next. 

Though the details of these two chapters may seem mathematically 
forbidding, drastic simplifying assumptions are made in them to keep 
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extraneous difficulties to a minimum. These typically take the form | 
of assuming that certain sets of acts, events, and values of random varia- 
bles are finite. Even in elementary applications of the theory, these 
simplifying assumptions seldom actually hold. In some contexts, it is 
quite elementary to relax them sufficiently; in others, serious mathe- 
matical effort has been required; and some are still at the frontier of 
research. Relaxations of the assumptions will be touched on from time 
to time, sometimes explicitly but sometimes only implicitly in the choice 
of suggestive notation and nomenclature. 

Beyond this introduction, the present chapter is divided into four 
sections: § 2 analyzes informally and then formally the notion of a cost- 
free observation; §§ 3 and 4 discuss certain obvious but important con- 
ditions under which one observation, and similarly one set of acts, is 
more valuable than another; §5 abstractly discusses problems of de- 
signing experiments or, perhaps more generally, observational programs. 


2 What an observation is 


To begin with an informal survey of observation, consider a decision 
problem, that is, a person faced with a decision among several acts. 
Calling it the basic decision problem and the acts associated with it 
the basic acts, a new decision problem would arise, if the person were 
informed before he made his decision that a particular event, say B, 
obtained. The new decision problem is related to the basic decision 
problem in a simple way; for the acts associated with it are also the 
basic acts, and the decision is to be made by computing the expected 
utility given B of the basic acts and deciding on one that maximizes 
the conditional expected utility. The basic problem may be modified 
in still another, though closely related, way. Let the person say in ad- 
vance, for each possible B;, which of the basic acts he will decide on 
when he is informed, as he is to be, which element B; of a given parti- 
tion obtains. This will be called the derived decision problem arising 
from the basic decision problem and the observation of 7, and its acts 
will be called derived acts. Technically speaking, the derived acts are 
determined by arbitrarily assigning one basic act to each element of 
the partition. For any state s, the consequence of a derived act is the 
consequence for s of the basic act associated with the particular B; in 
which s lies. The terms informally introduced in this paragraph are 
defined formally later in the section. 

A derived decision problem is not necessarily different in kind from 
the basic problem; indeed it is quite possible that the basic problem can 
itself be viewed as derived from some other basic problem and obser- 
vation. 
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Formidable though the description of a derived problem may seem 
at first reading, its solution is, in a sense, easy and has already almost 
been given; for it is clear that, if P(B;) > 0, the person will decide to 
associate with B; a basic act the expected utility of which given B; is 
as high as possible, and, if P(B;) = 0, it is immaterial to the person 
which basic act is associated with B;. 

It is almost obvious that the value of a derived problem cannot be 
less, and typically is greater, than the value of the basic problem from 
which it is derived. After all, any basic act is among the derived acts, 
so that any expected utility that can be attained by deciding on a basic 
act can be attained by deciding on the same basic act considered as a 
derived act. In short, the person is free to ignore the observation. 
That obvious fact is the theory’s expression of the commonplace that 
knowledge is not disadvantageous. 

It sometimes happens that a real person avoids finding something 
out or that his friends feel duty bound to keep something from him, 
saying that what he doesn’t know can’t hurt him; the jealous spouse 
and the hypochondriac are familiar tragic examples. Such apparent 
exceptions to the principle that forewarned is forearmed call for anal- 
ysis. At first sight, one might be inclined to say that the person who 
refuses freely proffered information is behaving irrationally and in vio- 
lation of the postulates. But perhaps it is better to admit that informa- 
tion that seems free may prove expensive by doing psychological harm 
to its recipient. Consider, for example, a sick person who is certain 
that he has the best of medical care and is in a position to find out 
whether his sickness is mortal. He may decide that his own personality 
is such that, though he can continue with some cheer to live in the 
fear that he may possibly die soon, what is left of his life would be 
agony, if he knew that death were imminent. Under such circumstances, 
far from calling him irrational, we might extol the person’s rationality, 
if he abstamed from the information. On the other hand, such an in- 
terpretation may seem forced. (Cf. Criticism (f) of § 5.6.) 

Examples of decisions based on observation are on every hand, but 
it will be worth while to examine one in some detail before undertaking 
an abstract mathematical analysis of such decisions. Any example 
would have to be highly idealized for simplicity, because the complexity 
of any real decision problem defies complete explicit description, but 
particular simplicity is in order here. 

The person in the example is considering whether to buy some of the 
grapes he sees in a grocery store and, if so, in what quantity. To his 
taste, the grapes may be of any of three qualities, poor, fair, and excel- 
lent. Call the qualities Q generically and 1, 2, and 3 individually. From 
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what the person knows at the moment, including of course the appear- 
ance of the grapes, he cannot be certain of their quality, but he attaches 
personal probability to each of the three possibilities according to 
Table 1. 

TaBLe 1. P(Q) 


Q(uality) 1 2 3 
P(robability) 1/4 1/2 1/4 


The person can decide to buy 0, 1, 2, or 3 pounds of grapes; these 
are the basic acts of the example. Taking one consideration with an- 
other, he finds the consequences of each act, measured in utiles, in 
each of the three possible events to be those given in the body of Table 
2. The expected utilities in the right margin of Table 2 follow, of 
course, from Table 1 and the body of Table 2. 


TABLE 2. Utiuity f(Q) FoR EACH f AND EACH Q 


Q 
f 1 2 3 | E(f) 
0 0 0 oO 0 
1 —1 Yr 3 1 
2 a3 0 5 1/2 
3 —-6 -2 6 = 


The entries in Table 2 have not been chosen haphazardly, but with 
an attempt at verisimilitude. Thus it is supposed that if the person 
buys grapes of poor quality his dissatisfaction with the bargain will 
accelerate rapidly with the amount bought, which seems reasonable, 
especially if the keeping quality of poor grapes is low. He is, of course, 
unaffected by the quality if he buys none. Again, buying a few fair 
grapes may be mildly desirable, but overbuying is not. Finally, excel- 
lent grapes are worth buying, even in large quantities, but the utility 
of the purchase increases less than proportionally to the amount bought. 

The correct solution of the basic decision problem is to buy 1 pound 
of grapes; for that act has, according to the right margin of Table 2, 
an expected utility of 1, which is the largest that can be attained. 

Now, suppose the person is free to make an observation, that 1s, a 
new observation in addition to those that may have contributed to the 
determination of the probabilities in the basic problem. It may be, for 
example, that the grocer invites him to eat a few of the grapes or that 
the person is going to ask the woman beside him how they look to her. 
Let there be five possible outcomes of his observation; call them zx 
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generically and 1, 2, 3, 4, and 5 individually. I assume, though this 
feature is rather incidental to the example, that low values of x tend 
to be suggestive of low quality. The joint distribution of x and Q, that 
is, the probability that x and Q simultaneously have any given pair of 
values, is of central technical importance. Those probabilities, each 
multiplied by 128 for simplicity of presentation, are given in the body 
of Table 3. The right-hand and bottom margins of the table give, 


TABLE 3. 128P(x 1) Q) 


Q 
1 2 3 | 128P(z) 
1 15 5 1 21 
2 10 15 2 27 
3 4 24 4 32 
4 2 15 10 27 
5 1 5 15 21 
32 64 32 128 
128P(Q) 


also multiplied by 128, the probability of each value of x and of each 
value of Q. The marginal entries are, of course, obtained by adding 
rows and columns. As indicated in the lower right-hand corner of the 
table, the probabilities assumed do indeed add up to 1, and the bottom 
margin recapitulates Table 1. 

Conditional probabilities can easily be read from Table 3. Thus, for 
example, the conditional probability that x is 2, given that Q is 3, is 
2/32, and the conditional probability that Q is 2, given that x is 4, is 
15/27. It will be seen in later sections that the distribution of x given 
Q is, in @ sense, even more fundamental than the joint distribution of 
x and Q. 

There are 4° = 1,024 derived acts, since one of the four basic acts 
can be assigned arbitrarily to each of the five possible outcomes of the 
observation. It is an easy exercise, using Tables 2 and 3, to verify 
Table 4, which shows the conditional expectation of the utility of each 


TaBLE 4. E(f | x) 


x 
f ] 2 3 4 i) 


0/21 0/27 0/32 0/27 0/21 
7/24 11/27 82/82 43/27 49/21 
8/32 44/27 72/21 
—94/21 —78/21 48/32 18/27 74/21 
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basic act given each possible outcome of the observation. For each z, 
the highest expected utility, given that value of x, has been italicized. 
Thus, for example, only if x is 1 will the person refrain from buying 
grapes altogether, and only if x is 5 will he risk buying 3 pounds. In 
full, the best derived act, call it g, is to buy 0, 1, 1, 2, or 3 pounds, if x 
is 1, 2, 3, 4, or 5, respectively. The value of the derived problem is the 
expected value of g, which is computed thus: 


(1) E(g) = >> E(g| 2)P(2) 


(O + 11 + 32 + 44 + 74)/128 
161/128 ~ 1.26 utiles. 


Since the value of the basic problem is 1 utile, the envisaged observa- 
tion is worth 0.26 utile; that is, the person would if necessary pay up 
to 0.26 utile for the observation. 


Exercise 


1. Suppose that the person could directly observe the quality of the 
grapes. Show that his best derived act would then yield 2 utiles, and 
show that it could not possibly lead him to buy 2 pounds of the grapes. 


The notion of a decision problem based on an observation will now be 
formally described, with special reference to mathematical notation and 
other technical details. 

1. There is a set of basic acts, F with elements f, f’, etc. 

In the example of the grapes F consisted of the four envisaged acts 
of buying 0, 1, 2, or 3 pounds of grapes. 

The convention laid down at the end of § 5.4, requiring that the con- 
sequences of acts be measured in utiles, will be adhered to, and it will 
be supposed that v(F) is finite. 

2. The observation is a (not necessarily real) random variable x 
associating with each state s an observed value z(s) in some set X of 
possible observed values x, x’, etc. 

In the example of the grapes, the states s (of which the postulates 
require that there be an infinite number) were never fully described, 
and consequently the random variable x was not fully described either. 
In the same sense it may be said that the basic acts, which are also 
really random variables, were not fully described either. All that is 
really important, however, is to know the simultaneous distribution of 
the consequences of the acts in F and of the values of x. In the example 
of the grapes that information was implicit in Tables 2 and 3. 
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For mathematical simplicity in the formal work to follow, it will 
generally be assumed that X has only a finite number of elements, 
though the assumption can and must be relaxed in many practical 
situations. When X is assumed finite, the random variable x is, for 
all purposes of the present context, simply a partition of S, namely, 
the partition into the sets on which x is constant. Indeed, earlier in 
this section, the notion of observation was described in terms of a par- 
tition, but the description in terms of a random variable is more familiar 
in statistics and may have technical advantages, especially when the 
restriction that X be finite is relaxed. 

3. The set of strategy functions is the set of all functions associating 
an element of F with each element x of X. Let the values of the generic 
strategy function be denoted by f(x) and the function itself by f(x). 

The notion of strategy function was not introduced in the informal 
description of observation, nor in the example of the grapes, because 
it is but a mathematical intermediary to the definition of derived acts 
and did not seem to call for explicit expression in the less formal con- 
texts. 

4. To each strategy function f(x) corresponds a derived act g, in the 
set of all derived acts F(x), defined by 


(2) g(s) = f(s; x(s)) for alls ¢S. 


It was explained that in the example of the grapes there are 4° de- 
rived acts. In the same way, it can be seen in general that if X has ¢ 
and F has ¢ elements there are ¢é derived acts. 

5. The value of F given z, 


(3) v(F | x) =r sup E(f | x). 


This is the function of x indicated, for the example of the grapes, 
by italics in Table 4. 


3 Multiple observations, and extensions of observations and of sets 
of acts 


If several random variables x), ---, Xn, associating elements of S 
with elements of sets X,, ---, Xn, are simultaneously under discussion, 
it is natural to form the new random variable, denoted x = {x, ---, 
Xn}, that associates with each element of S an ordered n-tuple of ele- 
ments of X1, --+, Xn, respectively. If the context is such that x, ---, 
X, are thought of as observations, then x can also be thought of as an 
observation and will sometimes be called a multiple observation—to 
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emphasize the manner of its formation. To illustrate, any item such 
as profession or body temperature that might be entered on a patient’s 
history can be thought of as an observation; but the whole history, or 
a filing cabinet of histories, can also be thought of as an observation, 
the history being a multiple observation of items, and the cabinet a 
multiple observation of histories. 

Consider two observations x and y. It is an interesting possibility 
that x and y are so related to each other that knowledge of the value 
of x would (almost certainly) imply (almost certain) knowledge of y. 
In that case, observation of x implies essentially the observation of y 
and generally something besides, which suggests the following three 
definitions. 

If and only if x and y are observations such that, for all s and s’ in 
some B of probability one, x(s) = x(s’) implies y(s) = y(s’); then x is an 
extension of y, and y is a contraction of x. If x is an extension of y, 
and y is an extension of x, then x and y are equivalent. 

Strictly speaking, one should say not that x and y are equivalent, 
but rather that they are equivalent regarded as observations, for this 
would not be a good concept of equivalence to apply to random varia- 
bles regarded as such. For example, a pair of equivalent observations 
can obviously be a pair of real random variables with different expected 
values. Some properties of the relations of extension, contraction, and 
equivalence between observations are given by the following easy but 
important exercises. Throughout this set of exercises it is unnecessary 
to suppose the observations confined to a finite set of values; in the case 
of Exercise 3b, it is impossible to do so. 


Exercises 


1. x and y are equivalent, if and only if x is both an extension and a 
contraction of y. 

2a. If P{x(s) = y(s)} = 1, x and y are equivalent. 

2b. Any observation x is equivalent to itself. 

3a. If there is a value yo such that P{y(s) = yo} = 1, then every 
X is an extension of y, and any two such observations are equivalent. 
Such an observation, of course, amounts to observing nothing at all 
and will therefore be called a null observation. 

3b. If x(s) = s for almost all s ¢S, then x extends every y. 

4. If x is an extension of y, and y is an extension of z, then x is an 
extension of z. State and verify the analogous fact about equivalence. 

5a. If y’ is a function associating an element of Y with each element 
of X, and x is an observation, then the observation y such that y = 
y’(x) is a contraction of x. 
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5b. If y is a contraction of x, then there is a function y’ such that 
P{y(s) = y’(a(s))} = 1. What freedom is there in the choice of the 
function y’? 

5c. What are the implications of Exercises 5a and 5b for equivalence 
between observations? 

6. If x and y are observations and z = {x, y} is the corresponding 
double observation, then z is an extension of x and of y. (This exercise 
seems to call for a converse saying that every extension can be regarded 
as a double observation, but no really neat one suggests itself to me. 
None the less, in thinking about extensions and contractions, the sort 
brought out by the exercise is a typical and stimulating example.) 

7. {x, y} is equivalent to x, if and only if x extends y. 


The relations of extension, contraction, and equivalence have paral- 
lels for sets of acts, defined thus: 

If F and G are (non-vacuous) sets of acts such that, for some B of 
probability one, there is for each g ¢ G an f ¢ F with f(s) = g(s) for all 
s e B; then F is an extension of G, and G is a contraction of F. If F is 
an extension of G, and G is an extension of F, then F and G are equiv- 
alent. 


More exercises 


8. If F is an extension of (equivalent to) G, then v(F) > (=) v(G). 

9. Discuss the analogues of Exercises 1, 2b, and 4 for sets of 
acts. 

10. If F > G, then F extends G. 

11. If F(x) is derived from F on observation of x, then F(x) extends 
F. 

12. Hyp. 


F(x) is derived from F on observation of x; 

F(y) is derived from F on observation of y; 

F(x, y) is derived from F on observation of {x, y}; 
F(x; y) is derived from F(x) on observation of y. 


CONCL. 


1. F(x, y) 1s equivalent to F(x; y). 

2. F(x, y) extends F(x) and F(y). 

3. If x is equivalent to y, then F(x) is equivalent to F(y). 

4. If y extends x; then F(x, y) is equivalent to F(y), F(y) is equiva- 
lent to F(x; y), and F(y) extends F(x). 
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18a. Under the hypothesis of 12, the equivalences and relations of 
extension among the sets of acts arising out of two observations can, 
with evident conventions, be diagrammed thus: 


Xa UA y 


13b. If y extends x, the diagram becomes 


xY -x*%Y y3sx ylox-d. 


13c. If x and y are equivalent, the diagram becomes 


14. If F(x) and G(x) are derived from F and G, respectively, and if 
F extends G, then F(x) extends G(x). 


15. o(F(x)) = Elo(F|x)] = f »(F | 2(s)) dP(s) > v(F). 


4 Dominance and admissibility 


According to Exercise 3.14, if one set of acts, regarded as basic, ex- 
tends another, the first is at least as valuable as the second in the light 
of any observation whatever. This section explores a relation, domi- 
nance, which has the same property but is not so strict as extension. 
Dominance is of some importance for the theory of personal probability 
as it has been developed thus far. But its ‘importance will be even 
greater in the study of statistics proper, where interpersonal agreement 
is of particular interest; for, as the definition shortly to be given will 
make clear, two people having different personal probabilities will agree 
as to whether one of two sets of acts dominates another, if only they 
agree which events have probability zero—a condition generally met 
in practice, and one that could if desired be dispensed with by a slight 
change in the definition of dominance. 

It will be seen that dominance and notions related to it are intimately 
associated with the sure-thing principle. Indeed, probability being 
taken for granted, the basic facts about dominance seem to give a com- 
plete expression of the sure-thing principle. Dominance and related 
concepts were much stressed by Wald, in [W3] for example. 
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Two or three notions, the logical connections among them, and those 
between them and extension, are to be treated. The logical connec- 
tions being many but simple, I think that the material lends itself bet- 
ter to formal than to expository treatment, for in such a context the 
reader who looks for the motivating ideas sees them himself more easily 
than he comprehends someone else’s verbalization of them. This sec- 
tion will therefore consist primarily of a group of formal definitions and 
several exercises. 


If and only if P(f(s) > g(s)) = 1, f dominates g. If and only if some 
(every) element of F dominates (is dominated by) g, F dominates (is 
dominated by) g. If and only if F dominates every element of G, 
F dominates G. If and only if f dominates g, but g does not dominate 
f, f strictly dominates g. If and only if f <¢ F, and f is not strictly domi- 
nated by any element of F, f is admissible (with respect to F). 


Involving as they do acts as well as sets of acts, the definitions, 
strictly speaking, introduce four different kinds of dominance. How- 
ever, this complexity can be alleviated, with a slight lapse of logic, by 
identifying each act f with the set of acts of which f is the only element, 
for it is easily seen that this identification is in such harmony with the 
definition that, once it is made, the four kinds of dominance collapse 
into one. 


Exercises 


la. Consider analogues of Exercises 3.2b and 3.4. 

lb. When can two acts dominate each other? 

2a. If F extends G, then F dominates G. Discuss the converse. 

2b. F(x) dominates F. 

2c. If F D G, then F dominates G. 

3a. If F C G, and F dominates G, then each admissible element of G 
dominates and is dominated by an element of F. 

3b. After any finite number of non-admissible elements is deleted 
from F, what remains of any subset of F that dominated F continues to 
dominate F. 

3c. Though the set of admissible elements of F may in some instances 
dominate F, no proper subset of the set of admissible elements can ever 
do so; but, if any other subset dominates F, some proper subset of it 
also does so. 

3d. If F is finite, the set of admissible elements of F dominates F. 

3e. Discuss the role of ‘finite’ in 3b and 3d. 
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4a. If the set of admissible elements of F dominates G, and G domi- 
nates F, then the set of admissible elements of F is equivalent to the 
set of admissible elements of G. 

4b. If F and G dominate each other, and either is finite, then the 
sets of admissible elements of F and G, respectively, are equivalent to 
each other, and each dominates both F and G. 

5. If F dominates G, then v(F) > v(G). 

6. If F dominates G, then, for any observation x, F(x) dominates 
G(x). 


6 Outline of the design of experiments 


Often, especially in statistics, a decision problem can be seen as the 
problem of deciding which of several experiments—or which of several 
observational programs, if that is really a more general term—to under- 
take. 

In this section the notion of the decision problem derived from a 
basic decision problem and an observation must be elaborated a little, 
because, as derived acts have been treated thus far, they correspond to 
the possibility of making an observation free of charge. Though obser- 
vations are sometimes free, there is typically a cost associated with 
making them; information must typically be bought either from other 
people or, more often from nature, so to speak. The cost of informa- 
tion may be money, trouble, one’s own life, that of another, or any of 
innumerable possibilities, but all can in principle be measured in terms 
of utility. The cost of an observation in utility may be negative as 
well as zero or positive; witness the cook that tastes the broth. 

In principle, if a number of experiments are available to a person, he 
has but to choose one whose set of derived acts has the greatest value 
to him, due account being taken of the cost of observation. That simple 
formulation, like some others in this book, is, in a sense, oversimple; it 
abstracts from the enormous variety of considerations that enter into 
the careful design of any experiment. The possibility of so abstracting 
from variety does not remove the ultimate necessity of studying some 
aspects of that variety in detail. R.A. Fisher’s The Design of Experi- 
ments [F4], for example, is concerned almost exclusively with experiments 
based on a special technique called the analysis of variance, and it is 
but an introduction to even that important facet of statistics. Again, 
there is a growing literature (in which the work of A. Wald is outstand- 
ing) on sequential analysis, which is concerned in principle with all ex- 
periments in which later parts of the experiment are conducted in the 
light of what happens in earlier parts; but this literature has, by neces- 
sity, been confined to a relatively tiny part of that domain. 
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Before turning to a more formal recapitulation of the outline of the 
design of experiments, this may be a good place for a few speculative 
words about the difference, if any, between experiment and observation. 

Some sciences are commonly called experimental as opposed to others 
that are called observational. Aerodynamics, the psychology of rote 
learning, and the genetics of fruit flies would typically be called experi- 
mental sciences; and, to take parallel examples, meteorology, the psy- 
chology of dreams, and human genetics would be called observational. 
But it is widely agreed, and the most casual consideration makes it 
clear, that any basic difference that may really be present resides not 
in the sciences themselves but in the methods typical of each. To illus- 
trate the role of observation in sciences ordinarily considered experi- 
mental and vice versa, observations of wild populations of fruit flies 
have been useful in the study of the genetics of fruit flies; the effects of 
fatigue, for example, on dream content may well be the subject of an 
experiment; and, except for the atom, no topic in science is more popu- 
lar today than experimental rain making. The illustrations could be 
extended indefinitely, and there is also a less direct sort exemplified by 
the discipline called experimental medicine, which typically studies ex- 
periments on animals with the hope, often justified, that the findings 
thus obtained can be extrapolated to humans. 

The problem, then, is to distinguish an experiment from an observa- 
tion. Except for brevity, it might be better to say mere observation, 
for, in general usage, an experiment would be considered a special sort 
of observation. 

The first apparent contrast that comes to mind is that experimenta- 
tion is generally thought of as active and observation as passive. But, 
upon examination, it is seen that observation is also active, for obser- 
vations are typically made by going somewhere to observe, or waiting 
attentively till something happens. Often it is not only the observer 
himself who must be transported and put in readiness to make an ob- 
servation, but also a considerable body of apparatus. What demands 
more activity than the modern observation of a solar eclipse? 

Another apparent contrast is that the experimenter acts on the thing 
he observes, whereas the observer acts only on himself and on instru- 
ments of observation that may be regarded as extensions of his own 
sense organs. If this criterion were accepted altogether naively, there 
would be no such thing as a physiological experiment on one’s self; 
even sophisticated interpretations might find it difficult to embrace 
psychological experiments on one’s self. 

Finally, experiments as opposed to observations are commonly sup- 
posed to be characterized by reproducibility and repeatability. But 
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the observation of the angle between two stars is easily repeatable and 
with highly reproducible results in double contrast to an experiment to 
determine the effect of exploding an atomic bomb near a battleship. 
All in all, however useful the distinction between observation and ex- 
periment may be in ordinary practice, I do not yet see that it admits 
of any solid analysis. At any rate, no formal use of the distinction will 
be attempted in this book. 

Return now to the notion of observation subject to cost. It may be 
that the value of the random variable x is observable but only at a 
cost c, a real-valued random variable measured in utiles. If, as hereto- 
fore, F(x) denotes the set of acts derived from F on cost-free observa- 
tion of x, let F(x) — c denote the set of derived acts subject to the ran- 
dom cost c. This notation is interpreted to mean that, if f is the generic 
element of F(x), then f — c (which, being a utility-valued function of 
s, is an act) is the generic act of the set F(x) — c. Very often the cost 
of an observation is independent of s, but not, for example, for him that 
tests the sharpness of a thorn with his finger. Since observations are 
typically paid for before, or simultaneously with, making the observa- 
tion, the cost is typically observed along with the observation proper. 
Put differently, the cost c is typically a contraction of the observation 
x. Thus, if in some special context any advantage were to be gained 
by so doing, it would not be drastic to assume the cost of observing x 
to be a function of the form c’(x); but, as a matter of fact, no such ad- 
vantage has come to my attention. It is not difficult to think of ex- 
periments to which the assumption does not apply. For example, in 
the present state of uncertainty about the long-term effects of x-rays, 
anyone conducting a short-term experiment in which young human be- 
ings were subjected to large doses of x-radiation would risk costs that 
might not overtly manifest themselves for half a century, or even for 
generations. 

Much that would ordinarily be called observation cannot be described 
by saying that the random cost is simply to be subtracted from each de- 
rived act of the corresponding observation thought of as free of cost. 
Allowing that it may be legendary, the form of trial by ordeal in which 
the guilty floated safely to be hanged and the innocent drowned to be 
exonerated epitomizes such a situation; except in point of absurdity, 
ordinary industrial destructive testing of electric fuses and other prod- 
ucts is much the same. Strictly speaking, discrepancy occurs even in 
the ordinary context in which the cost of observation is a fixed sum of 
money; for the utility of money is not strictly linear, so the cost of ob- 
servation typically affects different derived acts somewhat differently. 
This sort of situation is indeed so common as to introduce at least a 
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slight error into almost every application of the notion of cost as a sub- 
tractive term. It would therefore be desirable to extend considerably 
the notion of cost of observation, but, thus far, I see no way to do so 
that does not destroy the mathematical advantage of singling problems 
of observation out of the class of decision problems generally. 

It is convenient now to analyze the appropriateness of regarding the 
number v(F) as a measure of the value of F. As must already be clear 
to the reader, if a person is to make a preliminary decision limiting his 
next decision to one or another of several sets of acts, say, F, G, and H, 
then his preliminary decision will select a set that has the highest value 
of v, and the preliminary and secondary decisions, regarded as a single 
grand decision, amount to the problem of deciding on an act from 
F UG UH. So far as this use of v is concerned, any increasing mono- 
tonic function of v such as v® or 3” would be equally satisfactory, but v 
has an advantage in arithmetic simplicity when costs of observation 
are involved. Consider, for example, the problem of whether to make 
a particular observation at the random cost c or to make no observation 
at all. The two sets of acts involved may then be symbolized by 
(F(x) — c) and F, respectively. The peculiar simplicity of v as a meas- 
ure of the value of a set of acts, in this context, is exhibited by the almost 
obvious fact that v(F(x) — c) = o(F(x)) — E(c). It may be remarked 
in passing that v is a particularly good measure in any problem where 
F, G, or H is, so to speak, made available by lot, a possibility realized 
in (7.3.2), for example. 

Finally, if one among several observations is to be chosen, each with 
its own random cost (possibly including the null observation), the per- 
son will choose an observation for which v(F(x)) — H(c) is as large as 
possible. If the number of observations among which decision is to 
be made is infinite, that function may not attain a maximum value, 
but the value of the situation to the person can reasonably be regarded 
as the supremum of the function; there are, of course, observations 
among those available for which the supremum is arbitrarily nearly 
attained. 


CHAPTER 7 


Partition Problems 


1 Introduction 


In the introduction of the preceding chapter it was explained that 
the treatment of decision problems in general had been carried to a 
logical conclusion, and that to study decision problems further it had 
become necessary to specialize. The notion of observation was accord- 
ingly chosen as the subject of specialization. The situation now re- 
peats itself at a new level, for I have now covered the main points that 
occur to me about observation in general, though I see considerably 
more to say about a certain type of observation. 

The type of observation problem to which the present chapter is de- 
voted, though relatively special, is still very general. Indeed, its gen- 
erality is suggested by the fact that no other type of problem is syste- 
matically treated in modern statistics. In objectivistic terms, it would 
be described as the type of decision problem in which the consequence 
of each basie act depends only on which of several (possibly infinitely 
many) probability distributions does in fact apply to the random vari- 
able to be observed. 

Modern statistics has no name for this type of problem, because it 
recognizes no other type; and no particularly suggestive name occurs 
to me. I am therefore tentatively adopting the noncommital name 
“nartition problem.” Such motivation as there is for that name will 
be apparent when the concept is defined. 

In non-objectivistic terms, a partition problem has the following 
structure. There are, of course, basic acts F and an observation 7. 
The peculiar feature is a random variable b, which is typically not sub- 
ject to observation, with the property that every f in F is constant 
given that b has any particular value b. 

In many practical problems b takes on an infinity, even a non-de- 
numerable infinity, of values, but systematic consideration of such 
problems would involve those advanced mathematical techniques that 
are explicitly being avoided in this book. Glossing over such questions 
of technique for the moment, the state of the world, which is itself a 
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random variable, might play the role of b; with respect to this b, any 
observational decision problem would presumably be a partition prob- 
lem. It may, therefore, be inaccurate to call partition problems special, 
but they are special whenever b is not equivalent to the state of the 
world. 

As has just been mentioned, the general policy of this book with re- 
spect to mathematical technique restricts formal treatment of partition 
problems here to those in which b assumes only a finite number of dif- 
ferent values, that is to say, those in which b is to all intents and pur- 
poses a partition B;, whence the name “partition problem.”’ For the 
reader who is not familiar with the elements of the geometry of n-dimen- 
sional convex bodies, there will be a distinct expository advantage in 
confining the formal treatment still further to twofold partitions. At 
the same time, by explicit statements and by the use of suggestive no- 
tation, all readers will be given at least some idea of the extension of 
the theory to n-fold partitions; indeed, a reader familiar, for example, 
with Sections 16.1—2 of [V4], or with [B20] will find the extension as 
plain as if it had been made explicitly. Thus the restriction to twofold 
as opposed to n-fold partitions will be to the advantage of some and to 
the disadvantage of none. 

Partition problems are even closer than are observational problems 
generally to the subject matter of statistics proper. In particular, in 
the course of this chapter, multipersonal considerations will from time 
to time be pointed out in connection with partition problems. 


2 Structure of (twofold) partition problems 


A central feature of a twofold partition problem is, of course, a two- 
fold partition, or dichotomy, B;, 71 = 1, 2. By way of abbreviation let 
B(t) = P(B,), and 8 = {8(1), B(2)}. The 8(2)’s can be any two numbers 
such that B(z) > 0 and Z@(z) = B(1) + B(2) = 1. Since B(2) = 1 — 
B(1), it might seem superfluous to have a special notation for (2); but 
this redundancy more than pays for itself in symmetry, especially in 
the extension of the theory to n-fold partitions. The possibility that 
one of the 8(z)’s vanishes has been ruled out, for it is neither typical nor 
interesting, and its retention would mar the exposition of the theory. 

Each basic act f ¢ F is characterized by a pair of numbers f; such that 


ae P(f(s) = f;| Bi =1 


for each 2. The technical assumption will be made that as f ranges 
over F the numbers f; are bounded from above for each 2, which is a 
little more stringent than the now familiar assumption that v(F) < o. 
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The assumption expressed by (1) is made for definiteness and sim- 
plicity, though its full force will seldom be used. The possibility of re- 
laxing (1) in certain contexts will be mentioned from time to time, es- 
pecially since this possibility is of some interest even in the exploitation 
of (1) itself. In particular, for several pages now it will scarcely ever 
be necessary to assume anything about the structure of F relative to 
B,, except that E(f | B;) is bounded from above for each 7; for making 
the abbreviation f; = E(f | B;), almost everything from here through 
Exercise 1 applies verbatim. 

The expected utility of any f ¢ F can be computed in several forms 
thus: 


(2) E(f) = E(£| B,)P(B,) + E(f | By)P(Be) 
= fiB(1) + feB(2) 
= 2f.6 (0) 
= fe + (fi — f2)B6(1). 


The first of these forms expresses the expected value in general terms; 
the second utilizes abbreviations; the third is an obvious mathematical 
transcription of the second, particularly suggestive of extension to the 
n-fold situation; the fourth sacrifices the symmetry exhibited by the 
preceding three in order to take advantage of the relation between 
B(1) and 6(2). From the fourth form of (2), it is clear that, for fixed f, 
E(f) is a linear function of B(1). Henceforth that fact, for example, 
would be expressed in symmetric form by saying that E(f) is linear in 
8, and the dependence of E(f) on 8 might be een indicated by 
writing E(f | 6). 

Since in any one decision problem 8 is constant, it might seem point- 
less to emphasize that K(f | 8) is linear in 8. But there are, in fact, two 
different reasons for being interested in variation of 8. In the first place, 
once the observation x has been observed to have the value z, the basic, 
or a priori, decision problem is replaced by an a posteriori problem in 
which P(B; | 2) plays the role originally played by P(B;) = B(i). Sec- 
ond, interest in comparing different people is becoming increasingly 
more explicit as the book proceeds. In particular, it is of interest to 
compare people who have available the same set of basic acts and who, 
at least so far as the distribution of x and the acts in F are concerned, 
have the same conditional personal probability given B;, but who at- 
tach different probabilities 6(z) to the elements of the partition. 
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To emphasize its dependence on 8, v(F) will sometimes be written 
v(F | 8); its computation in the following fashion is fundamental to 
the theory of partition problems. 


(3) v(F | 8) = sup E(t | 8) 


sup [f:8(1) + feB(2)] 
= k(8), 


where k(8) is defined by the equation in which it occurs. According to 
Exercise 4 of Appendix 2, the function k is convex in 8, that is, k is 
convex when recognized as a function of 8(1) alone. Interpreted as a 
pair of a priori probabilities, 8 is confined to the open interval defined 
by 26(j) = 1, B(z4) > O, but it is valuable to recognize that k is defined, 
convex, and continuous on the closed interval 26(7) = 1, B(t) > 0. 
Many typical features of the relationship between F and B; are illus- 
trated graphically by Figure 1. The abscissa of that graph represents 


Figure 1 


both 6(1) and 8(2), as indicated, and the ordinate is measured in utiles. 
The straight lines, the left ends of which are marked a, b, c, d, and e, 
graph as functions of 6 the expected values of the five basic acts of the 
particular problem represented. The ordinates at their right and left 
ends, respectively, are the corresponding values of the f,’s and fo’s. 
The graph of k is marked by heavy line segments. It is seen that the 
lines a, c, and e, and they alone, touch the graph of k, for they repre- 
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sent the only acts that are optimal for some value of 8. The act repre- 
sented by d is inadmissible (if (1) is taken literally), being in fact strictly 
dominated by every other act except e, and it is therefore superfluous 
to the person, no matter what the value of 8; b is obviously equally 
superfluous, but for a different reason. 

In many typical problems in which F has an infinity of elements, k 
is, unlike the k in Figure 1, strictly convex; that is, its only intervals 
of linearity are point intervals. 


Exercise 


1. Compute and graph k for the set F of dichotomous acts of the 
form 


fig) =1-— (1+ 9)’; 


fo(o) = 1— (1 — ¢)?; 
Answer. k(8) = [8(1) — 6(2)]? = [28(1) — 1)’. 


Turn now to the relations between an observation x and the dichotomy 
B;. As before, it will be assumed for mathematical simplicity that the 
values of x are confined to a finite set X. The probability that x at- 
tains the value x given B,, written P(x | B;), is fundamental in connec- 
tion with partition problems. For one thing, as has already been indi- 
cated, there is interest in considering people who, though differing with 
respect to 8, agree with respect to P(x | B;). The probability P(z, B;) 
that x attains the value z and that B; simultaneously obtains, the proba- 
bility P(x) that x attains the value xz, and the probability B(¢ | x) of B; 
given that x(s) = x are derived from P(x | B;) and B by means of Bayes’ 
rule (3.5.4) and the partition rule (3.5.3) thus: 


—2<5¢<5 +2 


(4) P(x, B;) = P(x | ByB(A). 
(5) P(x) = X P(e, Bi). 
(6) B(i| x) = P(x, B;)/P(2), 


if P(x) ¥ 0; and Bi | x) is meaningless otherwise. It must be remem- 
bered that P(x, B;), P(x), and B(i | x) depend on the value of 8 and that 
a really complete notation would show that dependence. On the other 
hand, the condition that P(x) ¥ 0 is independent of the value of £. 
When a second observation y is to be discussed, 8(i | y) is, in defiance 
of strict logic, to be understood as the analogue of B(2 | x); that is, as 
the conditional probability of B; given that y(s) = y, not as the same 
function as B(c | x) with y substituted for x. Corresponding conven- 
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tions apply to P(y), P(y| B;, and P(y, B,). Finally, free use will be 
made of such contractions as (x) for {6(1 | x), B(2 | r)}. 

Equation (1) implies that 
(7) E(f | B;, x) = E(f| Bi) 


for all f ¢ F and for all x such that P(x | B;) > 0. Equation (7) is the 
mathematical essence of the concept of a partition problem, and vir- 
tually all that is to be said about partition problems applies verbatim, 
if (7), even without (1), applies to such observations as may be under 
discussion. 

In view of (7), 


(8) Ef |B, 2) = 2) E(f| By, x)P(B;| 2) 


if P(x) > 0. 


3 The value of observation 


If the observation x is made, and it is found that z(s) = x, then the 
a, posteriori value of the set of basic acts, written v(F | x), or more fully 
v(F | 8, x), will typically be different from the a priori value »(F| B). 
Indeed, in view of (2.8), 


(1) v(F | 8, 2) = sup E(£| 8, =) 


= 0(F | 6(2)) 
= k(B(x)). 


This is the first illustration of the technical convenience of the function k. 

It is known on general principles that v(F(x)) > v(F), but there is 
some interest in reverifying the inequality in the present context; in 
particular, it is possible here to say in interesting terms just when equal- 
ity can obtain. 


(2) v(F(x) | 8) = E((F | B(x) | 8) 
= E(k(6(x)) | 8) 
> k(E(8(x) | 8)), 


where the terminal inequality is an application of Theorem 1 of Appen- 
dix 2. To appreciate the inequality (2), it is necessary to calculate 
E(B(t | x)) explicitly. This calculation, typical of many the reader must 
henceforth be expected to make for himself, runs as follows, where it is 
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to be understood that the summation with respect to x applies only 
to those terms for which P(z) is different from 0. 


(3) E(p(i|x)| 8) = BC | 2) P(z) 


_ P(x, B;) 
7 ~ P(z) 


= DG, B;) 


= P(B;) = B(). 
Substituting (3) into (2) leads to the anticipated conclusion that 
(4) v(F(x) | 8) > k(6) = o(F | 8). 


According to Theorem 1 of Appendix 2, v(F(x) | B) is definitely greater 
than o(F | 8) unless B(x) is confined with probability one to some inter- 
val of linearity of k, in which case the observation x may fairly be 
called irrelevant to the basic decision problem at hand. If x is irrelev- 
ant, the interval of linearity to which 8(x) is confined must, in view of 
(3), contain 6. In the particularly interesting case—and the only pos- 
sible one, if k(@) is strictly convex—in which 6(x) is with probability 
one equal to a constant value, that value must therefore be 8. An ob- 
servation for which 6(x) is with probability one equal to 6 may fairly 
be called utterly irrelevant, because it is irrelevant no matter what set 
F of basic acts is associated with the dichotomy. 

To say that x is utterly irrelevant is to say that, with probability 
one, 


P(z) 


P(z | B,)B(2) 


(5) BG | 2) = “Ba 

= B(i). 
Since (7) > 0, (5) is equivalent to the condition that 
(6) P(x| B;) = P(e), 


at least when P(x) > 0. Furthermore, it is obvious from (2.5), again 
noting that 6(7) > 0, that, if P(x) = 0, then P(x | B;) = 0. Therefore 
x is utterly irrelevant, if and only if (6) holds for all z and 7; that is, if 
and only if the distribution of x given B; is independent of 7. This form 
of the condition is intuitively evoked by the words “utterly irrelevant’’ 
and has the advantage of not involving 8. 

It is noteworthy that whether an observation is utterly irrelevant 
depends neither on the particular set of basic acts, nor on the value of 
B, so pecple will agree on what is utterly irrelevant independent of their 
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personal a priori probabilities and the acts among which they are free 
to choose. 

The greatest lower bound in x of v(F'(x) | 8), namely v(F | 6), and the 
circumstances under which this bound is attained having been estab- 
lished, it is natural to turn to a parallel investigation of the least upper 
bound. A foothold for that investigation is found in the remark that 
the chord joining the ends of the graph of k never lies below the graph. 
Analytically, 
where [(8) is defined by the context. Unless one of the 8(z)’s vanishes, 


equality holds in (7), if and only if &(8) is a linear function. In view of 
(7) and (8), 

(8) v(F(x) | 8) = E(k(B(x)) | 8) < E(x) | 8) = 16). 

The inequality (8) gives an upper bound for v(F(x)). In graphical 
terms it says that, for any 8, no observation can add more to the value 
k(@) of F than the vertical distance at 6 between the graph of k and 
the graph of the chord joining the ends of k. 

Equality obtains in (8), if k is linear, in which case the upper and 
lower bounds are equal to each other irrespective of the value of 6 and 
the nature of the observation. If.F is dominated by a single f, that is, 
if there is a single f optimal given B; for both values of 7, then k is linear. 
It can easily be verified that, provided F is finite and’ (1) actually ob- 
tains, this is indeed the only circumstance under which k is linear, and, 
even if these provisions are not satisfied, the possibilities are not much 
more interesting. 

Suppose, then, that k is not linear; equality can hold in (8), if and 
only if 8(x) is with probability confined to the ends of the interval, a 
condition that does not depend at all on F. By simple considerations, 
which have by now been rendered familiar, this condition on x is equiv- 
alent to the condition that 


(9) P(x | B,)P(a| Be) = 0, 


for all x An observation satisfying (9) may fairly be called definitive, 
because, if (1) obtains, such an observation removes all uncertainty 
about the outcome of each f ¢ F, no matter what 8 may be. 

Perhaps many of the observations made in everyday life are defini- 
tive, or practically so. Once Old Mother Hubbard looked in the cup- 
board, her doubts were reduced to the vanishing point. None the less, 
definitive observations do not play an important part in statistical 
theory, precisely because statistics is mainly concerned with uncer- 
tainty, and there is no uncertainty once an observation definitive for 
the context at hand has been made. 
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4 Extension of observations, and sufficient statistics 


It was shown in § 6.4 that a statistic, or contraction, y of an obser- 
vation x is never worth more than x and is typically worth less. The 
purpose of the present section is to explore the relation between an ob- 
servation and a contraction of itself in the case of a partition problem, 
especially to explore the special conditions in that case under which the 
statistic is as valuable as the observation itself. 

Let x and y be two observations such that y is a statistic of x, that 
is, such that, for some function y’, y(s) = y’(x(s)) with probability one. 
The values of F(x) and F(y) can be compared by the following calcula- 
tion, which in the light of the preceding section will need but little ex- 
planation. 


(1) v(F(x)) = E(k(6(x)) | 8) 
= > E(k(6(x)) | 8, y)P(y). 


(2) E(k(8(x) | B, y) > k(E(8(x)) | 8, y)), 
if P(y) > 0. 
(3) E(6(i| x) | 8, y) = >) BG | 2) P(x | y) 


_ y AElOPG Y, 


P P(y) 
if P(y) > 0. 

Because of the special relationship between x and y, P(z, y) = 0 un- 
less y’(x) = y, in which case P(x, y) = P(x). Understanding that the 
summation indicated by 2’ in (4) below extends only over those values 
of x for which y’(x) = y, the calculation is continued thus: 


, P(x, Bi) P(«) 
P(x) Pty) 
_ P(a, Bi) 
P(y) 
_ Py, Bi) 
— PYy) 
= B(i| y). 


(4) E(B(i|x)| 8, y) = = 


— 
— 


Therefore, 


(5) v(F(x) | 8) > >= k(B(y)) P(y) = o(F(y) | 8). 
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After the preceding section, it seems almost superfluous to explain 
that the point of the calculation above is not to obtain the inequality 
(5), which has already been derived with less labor and greater gener- 
ality in Exercises 6.3.8 and 6.3.13b, but to be able to discuss when equal- 
ity holds in (5). The calculation makes it clear that equality holds in 
(5), if and only if equality holds in (2) for every y of positive probability. 
This in turn is equivalent to the condition that, given y, B(x) is confined 
with probability one to an interval of linearity of k. A sufficient con- 
dition for that is that, given y, B(x) be confined with probability one to 
a single value, which cannot be other than @(y); if k is strictly convex, 
the almost certain confinement of B(x) to B(y) is also necessary. Now, 
if, for every y of positive probability, P(8(z(s)) = B(y) | y) = 1, then 
it is true that B(x) = B(y) with unconditional probability one, that is, 


(6) P(B(x(s)) = B(y(s))) = 1. 


The condition (6) clearly does not depend on F, and the following 
calculation so expresses it as to make clear that it does not depend on 8 
either. Equation (6) is satisfied, if and only if 


P(x| BBG) _ P(y’(z) | BBO) 
Pit) = P(y'(z))— 

when P(x) > 0; or, if and only if 

P(«| Bs) P(z) 

PYy|B) Py) 

when P(x | B,) > 0; or, again, if and only if 

(9) P(x| By, y) = P(x|y), 


when P(y | B;) > 0; or finally if and only if P(x | B;, y) is independent 
of z for those values of 2 for which it is defined. In this form, and yet 
another to be derived in connection with (10), the condition is widely 
studied in modern statistical theory and a statistic satisfying the con- 
dition is there called a sufficient statistic. The name is well justified; 
for, as has just been shown, it is sufficient, for any purpose to which x 
might be put, to know y, if and only if y is a sufficient statistic for x. 

A different, and perhaps more congenial, approach to sufficient sta- 
tistics is the following. If the person observes the particular value y 
of y, his original basic decision problem is replaced by a new one with 
the same basic acts, but with 6 replaced by B(y). Strictly speaking, 
this will fail to be a partition problem, in case B(y) is (0, 1) or (1, 0), or, 
for brevity, if B(y) is extreme. To see whether v(F(x) | 8) is really greater 


(7) 


(8) 
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than v(F(y) | 8), it is enough to investigate whether, for some y of posi- 
tive probability for which B(y) is not extreme, x is relevant to the par- 
tition problem based on @(y), for if B(y) is extreme there can be no value 
in following the observation that y has occurred by the observation of 
x. Therefore, x will be a worthless addition to y, if, for every y for 
which B(y) is not extreme, x is utterly irrelevant, that is, if y is sufficient 
for x. If k is strictly convex, the condition is also necessary. 

The recognition of sufficient statistics in explicit problems is often 
facilitated by the following factorability criterion. A statistic y is suffi- 
cient for x if and only if there exists at least one pair of functions R and 
S such that 


(10) P(x | By) = R(y'(x); DS(2). 


The necessity of the condition follows from the exhibition of a particu- 
lar R and S for a sufficient statistic thus: 


(11) P(x| By) = DY P(x| Bi, y)P(y| Bi) 
= )) P(z| y)P(y | Bi) 


= P(y’(x) | B)P(z| y’(2)). 
On the other hand, if P(x | B;) can be expressed in the form (10), y 
can be seen to be sufficient for x thus: If P(x | B;, y) is meaningful, it 
is given by 
P(x, y| B; 
(12) Hel Repae ee 
Piy| Bi) 


= 0, if y’(x) # y, 


if y’(x) = y, 


dD, Sv’) 
u(x’) sy 
which is independent of 7. The reader may be interested in asking 
himself, as an exercise, what freedom there is in choosing R and S when 
at least one such pair of factors exists. 

Interest in sufficient statistics is not confined, of course, to twofold, 
or even finite, partitions. With that in mind, the various criteria for 
sufficient statistics have been given in such terms as to be valid for any 
finite partition and the usual infinite ones. They require some modifica- 
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tion if the observations are not confined to a finite, or at any rate de- 
numerable, set of values, but formal details of that important extension 
will not be given here. Elementary treatments are given in most text- 
books of mathematical statistics; more advanced and general treat- 
ments are given in [B2], [L6], and [H3]. 

There are several examples of sufficient statistics in the exercises 
below, others are given in almost any fairly advanced textbook on sta- 
tistics (in particular, in [C9]), and one other general example of extraor- 
dinary importance is treated in the next section. 


Exercises 


In these exercises, let x denote a multiple observation x = {x, ---, 
X,}, where, given B,, the x,’s are independent and identically distributed. 
There will be no real advantage here in thinking of the partition as 
twofold, or even finite, and for some of the exercises it will be imprac- 
tical to do so. 


1. Let P(x,|B) =p; if 2, = 1, 
=q,  ifz, =0, 
= 0, otherwise, 
where p; + g; = 1; and let y’(x) = >> 2,. 
; 


Show that: 
(a) P(«| Bi) = pat; 
(b) y is sufficient for x, using the factorability criterion; 


nN n 
(c) P(y| Bi) = ( ) pear, where, as always, ( ) = n!/y'(n — y)!; 
y y 


n \7 
(a) P(e| y'(2)) = ( | ) | 
y' (x) 


2. For each positive integer 7, let 


P(a,|B) =i, ifz, <i, 


= Q, otherwise, 


where the values of x, are confined to the positive integers; and let 
y’(x) = maxz,. Show that: 
(a) P(x | B,) 


nt 


mas ify <1, 


= 0, otherwise; 


(b) y is sufficient for x. 
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3. In the two exercises above it has been possible to choose the fac- 
tor S identically equal to 1. To exhibit a more typical example, let 7, 
xy, and y be confined to the positive integers with y’(x) = max 7,, as 
in the preceding exercise, and let 


22, ; 
P(x,| B) =————_—S—s iif a, <i, 
a(2 + 1) 
= 0, otherwise. 
Show that: 
2 n 
P(z| By -(—.) , ify <4, 
(a) P(x | Bi) et) II« ify <i 
= 0, otherwise. 


(b) y is sufficient for x. 

4. Put no restriction on the conditional distributions P(x, | B;), ex- 
cept that x, be confined with probability one to some fixed finite set. 
Say, for the moment, that two values x and 2’ of x are team mates, if 
one arises from the other by permutation of the component observa- 
tions. This divides the possible values of x into teams, and, academic 
though it may seem, the team to which x belongs can be taken as y’(z). 
Show that the probability of xz given y’(x) and B; is independent of 7 
(if it is defined at all), so that the statistic y’(x) is sufficient for x. 

If the values of the x,’s happen to be real numbers, then for any x 
it 1s possible to permute the component observations to obtain a non- 
decreasing sequence of n (not necessarily distinct) numbers, and only 
one such non-decreasing sequence can be so obtained from each z. 
The sequence thus attached through x to each s is called in statistical 
usage the sequence of order statistics corresponding to x. Since team 
mates, and only team mates, have the same order statistics, the set of 
order statistics regarded as a single statistic is equivalent to the team 
statistic y’(x) defined more generally in the paragraph above and is 
therefore sufficient. 

5. Let x, given B; be subject to the normal probability density with 
mean y,, and variance c,”, that is, 


(13) (2, | Bi) = (2x07) exp {—(a, — ui)®/20,7}. 


This situation, though elementary, does not fall within the technical 
scope of this book, because x, is not confined to a finite set of values. 
The reader familiar with probability densities will see, however, that 
the density of x is 


7 py Milt, i” 
(14) ¢(t,, +++ tn| Bi)= (2n07)”” exp| — O92 a are aaa 


0; 0; 20; 


’ 
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which suggests that y, defined by 
(15) y’ (zx) = (Qary" ZLr}, 


may fairly be called a sufficient statistic for x. 

Show in the same heuristic way that, if o; is independent of 2, then 
y'(x) = Da, defines a sufficient statistic; and that, if u; is mdependent 
of i, then y’(x) = nZa,? — (Zz,)? does so. 

6. If w and z are observations independent of each other given B,, 
under what conditions can w be sufficient for {w, z}? 

7. To break away from independent observations, suppose that, in 
the event B;, n cards are dealt from a thoroughly shuffled deck of n + 1 
cards each bearing a different serial number from 1 through n + 2. 
Let w, be the number on the rth card dealt and w = {wi, ---, Wn}. 
Show that max w, defines a sufficient statistic for w and that the w,’s 

r 


are not independent. 
8. If z extends w, and w is sufficient for y, then z is also sufficient for 


9, If z is sufficient for w, and y is independent of both z and w, then 
{z, y} is sufficient for {w, y}. 
10. Every definitive statistic is sufficient. 


In virtually all statistics texts it would be said that the y defined by 
(15) constitutes not one statistic, but two; similarly, the set of order 
statistics would ordinarily be referred to as n statistics rather than as 
one. There are contexts in which it is appropriate to try to count sta- 
tistics in that fashion, but, so far as the theory of sufficient statistics 
is concerned, it often seems fruitless, if not positively detrimental, to 
do so. 

The concept of sufficient statistics has proved of great value in sta- 
tistical theory and practice. The reason for this does not seem to me 
altogether easy to analyze, but, as the exercises above illustrate, the 
families of distributions most frequently studied in statistics are gen- 
erally rich in sufficient statistics. It is hard to separate cause from 
effect here; for the distributions that are most studied tend to be those 
having the greatest mathematical simplicity, and the presence of strik- 
ing sufficient statistics, such as those exhibited by Exercises 1, 2, 3, 5, 
and 7, are among the sources of mathematical simplicity most often 
met in the study of particular families of distributions. 

It must be emphasized that sufficient statistics often provide a signifi- 
cant saving in the mechanical labor of storing and presenting data. 
Thus, in any experiment faithfully represented by Exercise 1, it is 
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sufficient, in both the technical and ordinary senses of the word, to 
record a single integer y in place of the list of z,’s, which might well be 
very long. Several of the other exercises would in principle also lead 
to great savings of this sort, but Exercise 5 is the only other that arises 
frequently in practice. 

The concept of sufficient statistics was introduced, together with 
much of the theory associated with it, by R. A. Fisher (cf. index, [F6]). 
The subject has been one of continuing interest and has been explored 
in several directiqns; key references are [B2], [E1], [L6], [H3], [K15], 
and [M5], and (LeCam 1964). 


5 Likelihood ratios 


The random variable 6(x) has played so important a role in preced- 
ing sections that the reader will probably not be surprised to find that 
B(x) is a sufficient statistic for x, a conclusion that, in the light of the 
factorability criterion (4.10), can be seen thus: 


P(B;| x) 


(1) P(x | Bs) BG 
B(i | 2) 
= P(z). 
PC iia 


If a statistic is sufficient, it 1s sufficient irrespective of the value of 8; 
moreover, any multiple of it by a non-zero constant is also sufficient. 
Therefore, (1) implies that for any numbers a(z), such that a(z) > 0, 
the multiple observation r(a) defined by 
P(x | B;) 
ie, 0) =p 
La(j)P(zx | B;) 

(2) 

r(x; a) = Df iri(2, a), ro(x, a) } 
is a sufficient statistic for x. Since 


(3) Y alAri(ar; @) = 1 


there is some redundancy in retaining both components, but this re- 
dundancy is more than compensated by the advantage of retaining 
symmetry, especially when n-fold partitions are contemplated. 
Formally, the r(a)’s are an infinite family of sufficient statistics, one 
for each a; but to all intents and purposes they represent but one suffi- 
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cient statistic, for any r(a) is equivalent to any other, say r(a’), as can 
be demonstrated thus: 


P(x | B;)/Za'(k)P(x| By) 


@ Me @) = 5 HIP | B)/Sd OPI Bo) 


oe "s (x, a’) 
Za(j)r;(x, a’) 


Having such a multiplicity of forms for what is essentially one im- 
portant statistic is rather embarrassing, so there is some incentive to 
pick a standard form. Setting each a(j7) = 1 recommends itself as con- 
venient and leads to the particular statistic r = {r,, re}, where 


P(x| B;) 


This form is indeed convenient for twofold and, more generally, for n- 
fold partitions, but, where infinite partitions are to be dealt with, its 
apparent naturalness is misleading, for the sum in the denominator of 
(5) is then typically divergent. in the case of twofold partitions, a 
convenient form for the statistic is that of a likelihood ratio, in the 
sense introduced in § 3.6, for it is easy to see that, infinite numbers 
being admitted, P(x | B,)/P(x | Be) is equivalent to r. Henceforth, any 
statistic equivalent to r will be called a likelihood ratio of x with re- 
spect to the partition B;—a definition that does not seriously conflict 
with ordinary statistical usage of the term. 

Figure 1 illustrates a geometric interpretation of likelihood ratios 
that is sometimes valuable. The figure can best be described by telling 
how to draw it. First draw a pair of cartesian coordinate axes for varia- 
bles u; and ue. Next draw the two line segments represented by u; + 
Ug = 1 and (u;/a(1)) + (ue/a(2)) = 1 with the u,’s non-negative. The 
left ends of these segments are indicated in Figure 1 by a and 8, re- 
spectively, the particular value a = {1/3, 2/3} being used for illustra- 
tion. Now plot the point {P(x | B,), P(z | Bz)}. If x has positive 
probability (for any, and therefore for all, 8); this point will be different 
from the origin O, so it will be possible to draw the (dashed) line con- 
necting the origin with the point {P(x | B,), P(x| Bz)}. This line (or 
ray through the origin, as it is often called) must necessarily pierce 
the line segments a and b. The important geometrical fact, which the 
reader will have no difficulty in verifying, is that these intersections 
occur at the points {7;(x), re(x)} and {ry(z, a), re(z, a)}, respectively. 
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7 
a 
7 


_-° { P(x|By), P(x|Bo)} 


7 


. “try (x, Q), To (x, a} 


u—> 


Figure 1 


It is also obvious that the ratio P(x | B,)/P(x | Bz) is the reciprocal of 
the slope of the ray. 

Since, to each x that occurs with positive probability, there corre- 
sponds a ray through the origin, the ray can be taken as a statistic; 
according to the geometrical construction of the preceding paragraph, 
this statistic is equivalent to r and is therefore a likelihood ratio of x 
with respect to the partition B;. 

The ray connecting the origin with a point {u,, ue} can conveniently 
be represented by the suggestive notation w,:u2, though, of course, dif- 
ferent pairs of numbers can represent the same ray. More explicitly, 
if X is any number different from 0, Aw,:Aue represents the same ray 
AS Uy:U2. In analytical projective geometry any pair of numbers rep- 
resenting a ray in this fashion is called a set of homogeneous coordinates 
of the ray. The redundancy of the notation u;:ue may be removed by, 
for example, characterizing the ray by the reciprocal of its slope u4/Ue. 
Such non-homogeneous coordinatization entails a sacrifice in symmetry 
and the necessity of admitting infinity as a meaningful value of the 
quotient; both losses are quite troublesome in extension of these geo- 
metric concepts to cartesian space of n dimensions, which is necessary 
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in connection with n-fold partitions. In homogeneous coordinates the 
likelihood ratio can conveniently be represented by any of the equally 
good sets of homogeneous coordinates, P(x | B,):P(z | Boa), 71(2)i1re(2), 
and r(x, a):ro(z, a). Finally, it may be remarked that P(x | B,)/ 
P(x | Bz) is a non-homogeneous coordinate. Thus the many equivalent 
forms in which the likelihood ratio statistics can be naturally expressed 
corresponds to the many different notations by which a ray through the 
origin can be naturally designated. 

The most remarkable fact about the likelihood ratio considered as a 
statistic is that it is necessary, so to speak, as well as sufficient. By that 
I mean that to have the advantages of knowing x it is necessary as 
well as sufficient to know the likelihood ratio. The point can be put 
formally thus: 


THEOREM | If y is sufficient for x, then y is an extension of r. 


Proor. The theorem is virtually obvious in terms of the factora- 
bility criterion for sufficient statistics, for in the notation of (4.10) 


R(y(z), 2) 
(6) De) ee 
ZR(y(x), J) 
with probability one, exhibiting r; as a function of y. @ 


CoROLLARY 1 If z is sufficient for x, and if every y sufficient for x 
is an extension of z, then z is equivalent to r. 

By ordinary analytic standards, the likelihood ratio seems to be a 
rather complicated statistic, at least in the case of n-fold partitions, 
where n is at all large; for, to one who takes seriously the idea that a 
multiple statistic should not also be regarded as a single statistic, the 
likelihood ratio seems at first sight to be n, or perhaps (n — 1), statis- 
tics. Yet Theorem 1 and its corollary show that the likelihood ratio is, 
in a fundamental sense, the most compact sufficient statistic that a 
partition problem admits. 

As an explicit example of a likelihood ratio, consider the twofold par- 
tition problem arising from Exercise 4.1 on confining attention to two 
different values of p, say p; and po. The likelihood ratio r is easily 
computed thus: 


(7) P(z| By) = p#U — py” 
. \utz) \Y 
= l= p*( mn ) = qi" (7!) 
1— p; qi 
SO a 
Ny. /q.\V 
(8) oY eee qi (p;/ qi) 


2q;"(pi/q;)” 
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Theorem 1 is thereby verified in the present instance; for (8) exhibits 
r explicitly as a contraction of y, and y is easily exhibited as a contrac- 


tion of r thus: 
oe (2) | 
log ae 
r 2(x) 1 


(9) y(z) = 


In this example, y is, in view of (8) and (9), equivalent to the likelihood 
ratio. 


Exercises 
1. Express k(@(x)) and v(F(x)) in terms of the likelihood ratio thus: 
(10) Bi; r) = pe rB(i)/ QL 383), 
(11) k(B(x)) = k(B(r(z))). 
(12) o(F(e) | 8) = X kB(r)) b> P(r | Bea) | 


2. This extended exercise develops the personalistic and behavioral- 
istic theory of what, following the objectivistic and verbalistic tradi- 
tions of statistics, is called the testing of a simple dichotomy, a type of 
decision problem that, though seldom very realistic, is a popular and 
instructive example with important implications for more realistic prob- 
lems. Verbalistically such a problem is described as that of making the 
best guess on the basis of an observation as to whether it is B, or By 
that obtains. Behavioralistically, this is generally interpreted as the 
problem of deciding, on the basis of observation, between two primary 
acts one of which is preferable to the other if B, obtains and vice versa 
if B, does. Here is one topic in which the assumption that 7 is confined 
to two values is rather more than simply a pedagogical simplification; 
a reader interested in relaxing the assumption will find pages 127-130 
of [W3] stimulating. 

Suppose that F contains only two acts f; and f. and is dominated by 
neither. Let oij = Df Ec; | B;). 

(a) There is no loss of generality in supposing 


22 — > dil — @ 
(13) Sp LS 

2 2 
which will henceforth be done. That is, it will be supposed that f, is 
appropriate only to B; and vice versa. 


0, 
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(b) Show that 


(14) (8) = 2 d1j8(3) for B(L) > 81/(81 + 82) = Bo(1) 
= 2) ¢2;6(j) for B(2) > 62/(1 + 82) = Bo(2) 


= Ebr + $01)B(1) + 3(di2 + $02)8(2) + | 816(2) — 826(1) | 
= 2) 8(j) + | 618(2) — 628(1) |, 


J 


where 8» and the e,’s are defined by the context. 

(c) E(f; | 8) = k(8), if and only if B(7) > Bo(i). This condition ob- 
tains for both 7’s simultaneously, if and only if B = Bo. 

(d) Show that 


(15) (6(r)) = p> €778(J) + | 81r28(2) — 82r18(1) }/E (3) 


= 2 @s8(j;7) for r; = ri*(B, Bo), 
j 


where 

Bo (7) /B(t) 
16 i*(B, Bo) = rae 
sa oe DS Bo()/B(0) 
and that 


(17) (F(x) | 8) = 22 8G) + D2 | oP (r | Be)8(2) — b2P(r | Bi)a(1) | 


= {e + 6[1 — 2P(r1 < r*(8, Bo) | Bi) 
— P(r = r*(B, Bo) | Bi)]}}8(1) 
+ {e+ 6:[1 — 2P(re < r2*(8, Bo) | Be) 
— P(r = r*(B, Bo) | Bz)]}8(2). 


(e) Any derived act f(x) determines a function i assigning an 7 to 
each x, i being implicitly defined thus: f(x) = f,;(2). Conversely any i 
determines a derived act. Show that E(f(x) | 8) = v(F(x) | 8), if and 
only if ry¢2)(x) > itz) *(B, Bo) for every x. Such a function 7(zx) is called 
a likelihood-ratio test associated with r*. Show that at least one likeli- 
hood-ratio test is associated with every value of r*, and that if P(r = r*) 
= 0 (which is typically the case) there is only one. 

(f) If f(x) is determined by a function of i, the probability of deciding 
on the inappropriate value of i in case B; obtains is generally called 
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the probability of an error of the j-th kind. Analytically the probabili- 
ties of error of the first and second kind are, respectively, 


(18) ey =pe P(i(z) = 2| By), eg =e P(i(x) = 1| Be). 


If i* is a likelihood-ratio test associated with r*, show that its errors 
of the first and second kind are subject to the bounds 


(19) P(ry < r1*| Bi) < e:* < P(r, < r1* | By) 
(20) P(r, > ry* | Bo) < eg* < P(r; > r1* | Bp). 


What about the typical case that P(r = r*) = 0? 

(¢) Show that, if 1 is at least as good as i* in the sense that e; < e,* 
for both 7’s, then i is a likelihood-ratio test and i is virtually i* in that 
e; = e;* for both 2’s. Hint: Consider an F and a 8 for which r*(8, Bo) 
= r*, showing that these exist, and note that, for this decision problem, 


E (fim | B) = {€, — 62(1 — 2e,*)}8(1) + {eg — 51(1 — 2¢9*)}8(2) 
»(F(x) | 8) 

{€, — d2(1 — 2e1)}B(1) + fe2 — 6:(1 — 2¢2)}B(2) 
> o(F(x) | 8), 


with equality if and only if i is a likelihood-ratio test. 
This important conclusion about likelihood-ratio tests has been much 
emphasized, especially by the Neyman-Pearson school. 


(21) 


E(f; | B) 


The concept of likelihood ratio, sometimes simply called likelihood, 
is now one of the most pervasive concepts of statistical theory. It 
seems to have been introduced in 1922 by R. A. Fisher (ef. index of 
[F3]), who emphasized it in connection with the important method of 
estimation named by him “‘the method of maximum likelihood.” Its 
use in testing hypotheses was apparently first emphasized by J. Ney- 
man and E. 8. Pearson (see Vol. IT, p. 303 of [K2]). In connection with 
likelihood ratios as necessary and sufficient statistics, mathematically 
advanced readers will be interested in Section 6 of [L6], [B2], and 
[M5]. One of the earliest contributions in this direction was made by 
C. A. B. Smith [S14]. 


6 Repeated observations 


If x(n) = {x,, ---, Xn}, where, given B;, the x,’s are independent 
identically distributed random variables, then v(F(x(n))) is a non-de- 
creasing function of n, for the (n + 1)-tuple is an extension of the n- 
tuple. If (8) is strictly convex—a condition that you now recognize 
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as interesting—v(F(x(n))) is easily seen to be strictly increasing in n, 
unless the individual x,’s are either utterly irrelevant or definitive. 

It is to be expected, especially in the light of the approach to certainty 
discussed in § 3.6, that, as n becomes very large, x(n) will become prac- 
tically definitive. Indeed, § 3.6 makes it possible to state and prove a 
formal theorem to that effect. 


THEOREM 1 
Hyp. 1. x(n) = {x}, ---, Xn}, where, given B,, the x,’s are inde- 
pendent and identically distributed random variables. 

2. The x,’s are not utterly irrelevant to B;. 

3. o(F | 6) = k(6). 
CONCL. lim v(F(x(n)) | 8) = 18) =n B(1)K(1, 0) + B(2)K(0, 1) 
uniformly in 8. 

Proor. Writing x as short for x(n), 
(1) v(F(x) | 6) = Elk(a(x))). 


For an arbitrary e > 0, let the closed interval J on which k is defined 
be partitioned into two subsets J and K, where J is the set of those 
6’s such that 


and K is the complement of J relative to I. 

It follows from the continuity of the functions on each side of (2) 
that B eJ, if either component of 8 is sufficiently large. 

The computation initiated in (1) can now be carried forward thus: 


(3)  — Elk(8(x))] = Elk(6(x)) | B(x(s)) ¢ JIP(B(2(s)) J) 
+ E{k(6(x)) | B(x(s)) ¢ KIP(6(2(s)) ¢ K) 
> ETl((x)) | B(2(s)) ¢ JIP(B(2(s)) € J) 
+ min (6’)-P(6(2(s)) ¢K) — ¢ 


= E[l(8(x))] — {E(L(6()) | B@(s)) ¢ KI 
— min k(8)}P(6(=(s)) ¢K) — ¢ 
2 1(8) — max| k(6’) |-P@(@(s)) ¢ K) — « 
Now, in view of the paragraph in which (3.6.15) occurs and the fact 


that, if either component of 8 is close to 1, 8 eJ; P(B(x(s)) ¢ K) becomes 
arbitrarily small for sufficiently large n. @ 
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7 Sequential probability ratio procedures 


The present section digresses to discuss an interesting application of 
the ideas presented in this chapter to what is called sequential analysis. 
Sequential analysis refers in principle to the theory of observational pro- 
grams in which the selection of what observations to make in later 
phases of the program depends on what has been observed in earlier 
phases. Such behavior is commonplace in everyday life; for example, 
you look for something until you find it, but not longer. Statistics it- 
self has always used sequential procedures. For example, it is not rare 
to conduct a preliminary experiment to determine how a main experi- 
ment should be carried out. Thus, if one were required to estimate 
with a roughly preassigned precision the mean of a normal distribution 
of unknown mean and unknown variance, one might reasonably begin 
by taking ten or twenty observations, which would give some idea of 
the variance and would therefore determine about how many observa- 
tions are necessary for achieving the requisite precision. 

Commonplace though problems with sequential features are, A. Wald 
was the first to develop (1943) a systematic theory of a considerable 
body of problems of this sort. For early history see the Introduction 
of [W2] and the Foreword of Section I of [S17]. 

Some later ideas on sequential analysis, due mainly to Wald and 
Wolfowitz, are the subject of this section. It will not be practical to 
proceed with full rigor, primarily because random variables capable of 
assuming an infinite number of values are necessarily involved. Full 
details are given in [W3] and more compactly in [A7], but not in Wald’s 
book on sequential analysis [W2]. 

Let x = {x(1), ---, x(v), ---}, where the x(v)’s are conditionally an 
infinite sequence of independent, relevant, identically distributed ran- 
dom variables. Rather informally, a sequential observational program 
with respect to x is a rule telling whether to observe x(1) or whether to 
make no observation at all; if the particular value x(1) is observed, 
whether to observe x(2) or to discontinue observation; if the values 
z(1) and x(2) are observed whether to observe x(3) or to discontinue 
observation, etc. 

More formally, let N be a function of the infinite sequence of values 
x = {x(1), ---, z(v), ---} such that, if the sequence x’ agrees with x in 
every component from the first through the N(xz)th, then N(x’) = N(a). 
Such a function N determines a sequential observational program, 
which is a contraction of x, call it y(x; N), defined thus: 


(1) y (x; N) = Df {x(1), a) x(N (x))}. 
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It is to be understood that, if N(x) is zero for some 2, it is identically 
zero, and that y(x; 0) is a null observation. 

It will be assumed that the random cost associated with a sequential 
observational program is proportional to the number of random varia- 
bles observed, that is, c = N(x)y, y > 0. No categorical defense of 
this assumption is suggested, but clearly there are interesting problems 
in which it is met at least approximately. The domain of applicability 
of the theory can actually be considerably extended by modifying the 
assumption to include a fixed overhead cost that applies except in case 
N is identically zero; this does not greatly complicate the analysis, as 
the interested reader will be able to see for himself. The theory would 
even remain virtually unchanged, if c were only assumed to be of the 
form 


N (2) 
(2) c=h+ D> ci), if N > 0, 
v=] 
= Q, if N = 0, 
where h, c(1), c(2), --- are independent with finite expected values 


E(h) > 0, E(c(r)) > 0, and the c(v)’s are identically distributed. 

For any F there are some values of 8 for which it would be unwise to 
adopt any sequential observational program other than the null obser- 
vation. Suppose, for example, that 6 is so close to an extreme value 
that 1(8) — k(8) < y; under this circumstance the most that could be 
gained by observing even x itself would be less than y, but the cost of 
making so much as one observation is at least y. Let the set of values 
of 6 for which it is not Justified to make any but the null observation be 
denoted for a while by J(F; y), or simply J, for short. 

Now, if 8 ed, the person’s utility can, by the definition of J, be maxi- 
mized by refraining from any observation but the null observation and 
accepting the utility k(8); otherwise there will be some advantage to 
him in observing x(1). If the person does observe the particular value 
2(1) of x(1), he finds himself with a posteriori probabilities B(7(1)) in 
place of the a priori 8, he has paid (or at any rate entailed) a cost vy, 
and he must now decide whether to make any further observations. 
His new problem is simply the problem he would have faced at the out- 
set had his a priori probabilities been 8(x(1)) instead of 8, except that 
all utilities are now reduced by y. He Justifiably accepts the utility 
k(B(a(1))) — vy, if B(z(1)) ed; otherwise he will observe x(2). Continu- 
ing this line of argument step after step, it follows that optimal action 
consists in observing successive x(v)’s until an a posteriori probability 
in J occurs, and then adopting a basic act consistent with the a posteriori 
probability. 
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In actual practice, it is far from easy to determine whether a particu- 
lar value of 6 belongs to J(F; y), because in principle the whole enormous 
variety of sequential observational programs has to be explored to de- 
termine whether any one of them has a derived value greater than k(@). 
The practical advantage achieved in the preceding paragraph is that 
of greatly restricting the class of programs that merit consideration. 
Thus the problem of determining whether 6 ¢J(F; 7) does not require 
a survey of all observational programs, but only of those defined in 
terms of some set J’ according to the rule that N(z) is the first integer 
for which B(z(1), ---, 2(n)) ed’. 

If programs corresponding to all sets J’ had to be examined, the 
process would still be mathematically impractical; indeed, in all but 
special cases, practical solutions have yet to be found. But, if any 
special conditions that J must necessarily satisfy are discovered, only 
sets J’ satisfying those conditions need be examined. Some very gen- 
eral conditions are these: / contains the extreme points of I; J is topo- 
logically closed, that is, if a value 8 is not in J, then the near neighbors 
of Bo are also not in J. The first of these conditions requires no com- 
ment, and the second follows easily from the continuity as a function of 
B of 


(3) E{k(8(y(x; N))) — yN | 6] — £(6). 


These conditions alone do not go far toward narrowing to practical 
limits the variety of sets to be explored. Thus far in the development 
of the subject, really powerful conditions have been obtained only at 
the expense of considerable restrictions on the structure of F or, equiv- 
alently, of k. 

Suppose, then, that F is dominated by a finite number of acts or, 
what amounts to a little less, that the graph of k is polygonal, as it is 
for the k graphed in Figure 2.1. Technically, this restriction on k may 
be expressed by saying that the interval J is the union of a finite num- 
ber of intervals of linearity of k. Under the restriction, relatively much 
can be concluded about the structure of J(F; y), for it is true in general, 
as will be shown in the next paragraph, that the intersection of J with 
any interval of linearity of k is a closed interval. 

Suppose, indeed, that 8; and 82 belong toJ and to a common interval 
of linearity of k, but that Bo on the interval between 8; and B_ does not 
belong to J. A contradiction follows according to the following com- 
putation, in which h is any act derived from a sequential observational 
program, cost included, that is advantageous at Bo. 


(4) p> E(h | B;)Bo(j) > k(Bo), 
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for h is supposed to be advantageous at Bo; and 


for no derived act is supposed to be advantageous at Bm, since By ¢J. 
Since 8 is a weighted average, say Z¥mB8m, of the B»’s, and since k(8) is 
linear in the interval between 6; and Bs, it follows from (4) and (5) that 


(6) dX, E(h| Bi)Bo(é) < k(6o), 


contradicting (4). The supposition that By ¢~J has thus been re- 
duced to absurdity. 

The demonstration Just given extends directly to n-fold problems. 
The general conclusion is that the intersection of J with any domain 
of linearity of k is convex, so that, if k is polyhedral, J is the union of a 
finite number of closed convex sets, each lying wholly in a domain of 
linearity of k. The practical implications of the conclusion are enor- 
mously greater for twofold than for higher-fold problems, because 
twofold problems lead to one-dimensional bounded, closed, convex 
sets, which present no great variety, all of them being closed bounded 
intervals. But threefold problems, for example, lead to closed bounded 
two-dimensional convex sets, a restriction that leaves great room for 
variety. 

If k is polygonal, the variety of sets J’ to be surveyed is enormously 
reduced, for J’ must be the union of a known number of intervals, each 
of which is confined to a known interval. Suppose that this number is 
m; the class of sequential observational programs to be surveyed can 
be characterized by the two end points of each of the m intervals, ex- 
cept that the possibility that some of the intervals are vacuous must be 
borne in mind. Since the extremes of J are neeessarily in J, and there- 
fore necessarily appear as end points of intervals in J, the exploration 
has been reduced to a 2(m — 1) parameter family of possibilities. 

The possibility that m = 1, which almost means that F is dominated 
by a single element of itself, is trivial; for then all @’s are in J, and ob- 
servation is never called for. This can be seen m many ways. In par- 
ticular, it follows as an illustration of the machinery that has just been 
developed, thus: The end points, or extremes, of J are both in J, as al- 
ways, and, since m = 1, they are both in the same interval of linearity 
of J; therefore the interval between them, namely every value of 8, 
lies in J. 

The possibility that m = 2—in ordinary statistical usage, the se- 
quential testing of a simple dichotomy—is of particular importance. 
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It occurs typically when F is dominated by two acts, neither of which 
dominates the other, as in Exercise 5.2. One of the two acts is approp- 
riate to one “hypothesis” B,, and the other is appropriate to By. In 
case m = 2, it is easily seen, by methods that have now been indicated 
more than once, that each of the two closed intervals that constitute J 
has as one end point one of the extremes of J. Neither of the two inter- 
vals can be vacuous, nor can either consist only of a single point. It is 
relatively easy to find, at least approximately, the two values of 6 that 
determine J(F; y), and the theory of this situation has correspondingly 
been brought to a relatively high degree of perfection; for details, see 
[S17], [W2], [W3], and [A7]. 

Following (or at least paraphrasing) Wald [W2], a sequential obser- 
vational program characterized by making successive observations un- 
til the a posteriori probabilities fall into some set J, followed by adopt- 
ing a basic act appropriate to the a posteriori probability, is called a 
sequential probability ratio procedure. The reason for this nomencla- 
ture is that to observe until the a posteriori probabilities fall into J is 
to observe until the numbers 


B(i)P(x(1), «++, x(v) | By) 


(7) B(i| 2(1), «+, 2(v)) = DAP), «+, 20) |B) 


lie in a certain set, or, what amounts to the same thing, satisfy certain 
conditions. But, the particular value of 8 having been assigned, this 
is tantamount to requiring the ratios of probabilities 


P(x(1), «++, e(N) | Ba) 


8 
(8) P(x(1), ---, (NV) | B)) 


to satisfy certain conditions. 

Since (7) and (8) are ways of expressing the likelihood ratio, the ob- 
servational program together with the act derived from it might also 
be referred to as a sequential likelihood-ratio procedure. Indeed, but 
for the precedent established by Wald, that would seem the better 
name. 

As an actual example of a sequential probability ratio procedure, 
suppose that the distribution of x(v) given B; attaches the probabilities 
p; and g; = 1 — p; to the values 1 and 0, respectively. The expression 
(8) can in any case be written in the factored form 


pee | 2), 


9 
7 P(zx(v) | Bj) 


v=] 


7.7] SEQUENTIAL PROBABILITY RATIO PROCEDURES 147 


and in the present example this takes the special form 


(10) ere _ eS. 
P2 q2 q2 P24 
where 


N 
(11) y(N) = Dd) 2(0). 


v=1 


It is noteworthy, in connection with sufficient statistics, that the con- 
dition that the a posteriori probability be in J is in this case expressible, 
according to (10), as a condition on y(N) and N. Specializing the ex- 
ample further, suppose that J is of the sort appropriate to testing a 
simple dichotomy. The condition that the a posteriori probability be 
in ~J is then expressed by each of the following equivalent pairs of 
inequalities, where a(1) and a(2) are positive numbers such that a(1) 
Ay (DY <1. 
B(1| 2(1), ---, 2(N)) <1 — a(1), 


(12) 
B(2| 2(1), «++, 2(N)) <1 — a(2). 
B(1)Q 
sna +a@) ~~ 
" (2) 
B 
a()Q + 6@) ~~ 7) 


where Q for the moment denotes the likelihood ratio (10). 


B(2)(1 — a(1)) _ 
B(1)a(1) 


pQ)a2)_ oo. 
BI) = @(2)) 


where Q*, Qx are defined by the context. Since, according to (13), the 
structure of ~J/ is superficially determined by three parameters, say by 
B(1), a(1), and a(2), it is worthy of some note that the corresponding con- 
dition is ultimately expressed in terms of only two special parameters, 
Q* and Q; this is only natural, considering that ~J is an open interval 
determined by its two end points. The act that would be appropriate 
to B, is called for by values of Q > Q*, and the one appropriate to Bp 
is called for by values of Q < Qs. 


Q< Q*, 
(14) 


Q> 
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Thus far, the particular form (10) of the likelihood ratio has not 
really been exploited in the calculation, so (14) applies to the testing of 
simple dichotomies generally. Taking account of (10), (14) can by ele- 
mentary manipulation be put in the following form. 


y(N) < {log Q* + N log (q2/q:)} /log (p192/p2q1), 
y(N) > {log Q«x + N log (q2/q1)}/log (p192/p2q1), 


(15) 


where, for definiteness, it is supposed that p,; > po. Thus, the region 
in the (N, y) plane determined by ~J/, the region in which further ob- 
servations are called for, is a band bounded by two parallel lines of 
positive slope. 


8 Standard form, and absolute comparison between observations 


If x and y are such that, for every F and 8, v(F(x) | B) > v¥F(V) | 8); 
then x imitates, so to speak, an extension of y, and it may appropriately 
be said that x is a virtual extension of y. Correspondingly, if x is a vir- 
tual extension of y, and y is a virtual extension of x, it may be said that 
x and y are virtually equivalent. 

No matter what a priori probabilities a person may have, or what 
basic acts are available to him, he will have no preference between a 
pair of virtually equivalent observations, so virtually equivalent obser- 
vations are indeed equivalent for many practical purposes. Where com- 
binations of observations are under consideration, however, the rela- 
tion of virtual equivalence does not resemble true equivalence. For 
example, if x and y are equivalent, then each is equivalent to the mul- 
tiple observation {x, y}, but if x and y are only virtually equivalent, 
they may well be independent, in which case neither will typically be 
equivalent to {x, y}. 

This section explores the notions of virtual extension and virtual 
equivalence. In particular, an interesting standard representative of 
the class of observations virtually equivalent to a given observation x 
is defined and discussed. This material is scarcely referred to later in 
the book, and it may without much loss be skipped or glossed over. It 
will be couched frankly in the language of n-fold as opposed to twofold 
partitions, but readers with the rest of the chapter behind them will 
easily be able to concentrate on the twofold situation, if they find it 
more understandable. 

Most of the ideas to be presented in this section were originated by 
H. F. Bohnenblust, L. 8. Shapley, and 8S. Sherman in a private memo- 
randum dated August 1949, which I was privileged to see at that time. 
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This work was extended and brought to the attention of the public by 
David Blackwell in [B16]. 

It is obvious that, if y is a sufficient statistic for x, then x and y are 
virtually equivalent. In particular the likelihood ratio r derived from 
X is virtually equivalent tox. Moreover, the reader may anticipate, and 
it will be formally shown in the course of this section, that if and only 
if observations are virtually equivalent do their likelihood ratios have 
the same distribution for every value of B, or, what comes to the same 
thing, given each B;,71 = 1, ---, n. Thus the n conditional distribu- 
tions of the likelihood ratio given each B; could be taken to characterize 
the observations virtually equivalent to a given one, say x. Actually, 
as will be shown, the class of observations virtually equivalent to x can 
be represented by the distribution of the likelihood ratio for any single 
non-extreme value of 8. For definiteness, the particular value B* = 
{1/n, ---, 1/n} will be used, but the interested reader will find it a 
simple exercise to extend all the considerations based on 8* to any 
other non-extreme 8, as would be necessary in any extension of the theory 
to infinite partitions. 

Let m(r) be the probability that the likelihood ratio in the standard 
form (5.5) attains the particular value r when 6 = 6*. With self-evi- 
dent abbreviations, 


(1) m(r) = P(r | 8*) 
= 2D P(r| Bi)(A/n) 


1 
=-)) Dd P(x| B)). 

N jf r(x)=r 
The second line of (1) exhibits m(r) expressed in terms of the n distri- 
butions P(r | B;). It is rather more interesting to see that those n dis- 
tributions can themselves all be expressed in terms of the single dis- 
tribution m, as follows from the definition (5.5) of r and the third line 
of (1) thus: 


(2) P(r| B) = pa P(x | By) 
= 2 7;(2) LX P(z| B;) 
r(x) =r j 
= nrym(r). 
Similarly, 


(3) P(r| 8) =n 1 re} m(r). 
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Regarded as a probability measure on the set of all n-tuples of num- 
bers r, m has the following three important properties. 


P(r; > 0|m) = 1; 
(4) P(X = 1m) =1, 


E(r;|m) = n—, 
Of these, the first two are obvious from the definition of r, and the third 
follows by calculation from (2) thus: 


(5) 1= > P(r| B) =n Dd ran(r) © 


= nE(r; | m). 


Conversely, suppose that m is any mathematical probability defined 
on the set of n-tuples r of numbers, subject to the conditions (4), then, 
as can easily be verified, » mathematical probabilities are formally 
defined by the equation P(r | B;) = nrym(r). Mathematically, r dis- 
tributed thus can be regarded as an observation. The following calcu- 
lation demonstrates the expected conclusion that the likelihood ratio 
of this observation is the observation itself and that its distribution 
given 8* is m. 

P(r| Bi) —_—_nrgm(r) 
= P(r | B;) nn a rjm(r) 
J J 


(6) P(r | B*) — ya nrjm(r)(1/n) = m(r). 


It is interesting and fruitful to compute v(F(x) | 8) in terms of m. 


(7) —-o(F(x) | 8) = ERG) | 8) 
= Elk({r8()/ 2 148(9)}) | 6] 


= nF [k({r8(i)/ 22 18(9)}) 2 158(3) | ml. 


Temporarily adopt the convention that, if a is any n-tuple of positive 
numbers and h any function of r (not necessarily convex), T(a)h is a 
function of r defined thus: 


(8) T(a)h(r) = ps h({rsee(t)/ QD) rya(9)})Zrje(3). 
Then (7) takes the abbreviated form 
(9) E(k(8(x)) | 8) = nE(T(8)k(x) | m). 
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To see the implications of (9), it is necessary to know something about 
what the operation 7'(8) does to the function k, in particular to know 
that 7(6)k is convex in r. The derivation of these necessary facts is 
straightforward and is left to the reader as a sequence of exercises. 
Exercises 

la. T(a)T(6)h = T({a(1)6(1), «++, a(m)6(n)})h = T(6)T(a)h. 

lb. h = T({a(1)7}, «++, a(n)})T(o)h. 

1 
2. T(@*)h = —h. 
n 

3. If h(r) > g(r) for r between 7’ and r”; then T(a)h(r) > T(a)g(r) 

for r between rj’a(i)/ >, r;’a(j) and rj/’a(2)/ >, 1;/"a(j). 
j j 

4. If h is linear, then so is T(a)h. 

5. If h is convex (strictly convex), then so is 7(a)h. 

Exercise 5 is obvious in the light of Exercises 3 and 4, but some may 


prefer the demonstration suggested by the following calculation, where 
\ + w» = 153A, w => 0; and obvious abbreviations are used. 


(10) T(a)h@r + pr’) 
_ ( Nacr r par’ r’ 
7 a: (Ar + yr’) aon a (Ar + ur’) ar’ 


r r’ 
<A (— «) ar + ph (— «) ar’ 
arr acer 
= AT (a)h(r) + wT (a)h(r’). 


It is amusing to establish once more that observation generally pays, 
this time by means of (10), (4), and Exercises 5 and 2. 


(11) nE(T(8)k(r) | m) > nT'(6)k(E¢ | m)) 
= nT'(B)k(6*) 
= k(8). 


If x and x’ are observations and m and m’ are the corresponding dis- 
tributions, it is now easy to say in terms of m and m’ when x is utterly 
irrelevant, when it is definitive, and when x is virtually an extension of x’. 


«) a: (Ar + pr’) 


More exercises 

6. The observation x is utterly irrelevant if and only if P(r = 6* | m) 
= 1. 

7. The observation x is definitive; if and only if P(r; = 1 | m) = 1/n, 
or, equivalently, if and only if P(r; = 0 | m) = (n — 1)/n. 


152 PARTITION PROBLEMS [7.8 


8a. The observation x is a virtual extension of x’, if and only if, for 
every convex function h defined for r, 


(12) E(h(r) | m) > E(h(r) | m’). 


8b. The two observations are virtually equivalent, if and only if, for 
every convex function” h, 


(13) E(h(r) | m) = E(A(z) | m’). 


The conclusion reached in Exercise 8b can be much improved. In- 
deed, it will be shown that the two observations are virtually equiva- 
lent, if and only if m and m’ are the same probability measures. This 
will be achieved if, for example, it is shown that m and m’ have the 
same moments, for it is well known that two different countably addi- 
tive probability measures confined to a bounded set of n-tuples of num- 
bers cannot have the same moments.t The moments in question are 
expected values of monomials of the form 


(14) g(r) = ry%rg? +++ ry, 


where the ¢,;’s are non-negative integers. In general, g will not be 
convex, so it cannot be concluded immediately that g has the same 
expected value with respect to m and m’. If, however, a highly convex 
function is added to g, then the sum will be convex and its expected 
value will be the same with respect to m and m’. Since, by hypothesis, 
this is also true of the convex term of the sum, it must also be true of 
the not necessarily convex term. Specifically, let 


(15) h(r) = g(r) + Dor’, 


where A is a positive number to be determined later. To test h for con- 
vexity, let s be for the moment an arbitrary n-tuple of numbers and o 
a real variable, and compute the second derivate of h(r + os) with re- 
spect toc ato = 0. 


d?h(r + os) 679(r) 
cies eee oe oe 
de? = ars ar, 838; 2 s; 


Considering that each r; is between 0 and 1, the absolute values of the 
derivatives of g that appear in (16) have a common upper bound, say 


(16) 


t See, for example, Corollary 1.1, p. 11, of [S13]. 

Under our usual simplifying assumption that x is confined to a finite number of 
values, m is certainly countably additive. Actually, the whole theory can be de- 
veloped mutatis mutandis assuming only that the distribution of x is countably 
additive on some suitable Borel field. 

+ Morse and Sacksteder (1966) show, in effect, that the test can be confined 
to the very special convex functions max p;r;, where the p; are arbitrary posi- 
tive numbers. 
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u; 80, if \ > yn?, h is convex in the region where each 1; lies between 0 
and 1 and is a fortiori convex in the intersection of that region with 
the hyperplane 2r; = 1. 

Now that it has been established that m and m’ represent virtually 
equivalent observations, if and only if m and m’ are identical, it is ap- 
parent that m—or, more exactly, the set of conditional distributions 
P(r| B;) = nrgm(r)—is a unique standard form for all observations 
virtually equivalent to x. 

If x virtually extends y, it is to be expected that, no matter what rea- 
sonable definition of “informative”? may be suggested, x will be at least 
as informative as y. In particular, it is to be expected that the infor- 
mation of B; with respect to B; (as defined in § 3.6) will be at least as 
large for x as for y, which the following calculation verifies, supposing 
for simplicity that, for both observations, infinite information is im- 
possible. The point in question depends on the convexity of the func- 
tion h defined by 


(17) h(r) = r(log r; — log r;), 
because 
(18) 1;,; = E(log r; — log 7; | By) 


= nE[r;(log r; — log r;) | m]. 


The required convexity can be demonstrated much as it was in (15)t 
for a different function also momentarily called h: 


q? 07h(r) 07h(r) 07h(r) 
19) —h”(r+os = s;7 + 2 s;s; + ——s,” 
ae do® ORee oxo)?! dr; or; sr?” 
_ 8,7 28,8; 1485" 
ry ’j rj? 


It would be interesting to know whether every virtual extension is 
realized by an actual extension, that is, whether whenever x is a vir- 
tual extension of y there exist random variables x’ and y’ such that x 
and x’ are virtually equivalent, y and y’ are virtually equivalent, and 
x’ extends y’. To the best of my knowledge that conclusion has thus 
far been established only in the case of twofold problems, the demon- 
stration for that case being given by Blackwell in [B16]. 


+ Actually, this calculation depends only on the convexity of (log r; — 
log r;) in r j/Ty 


CHAPTER 8 


Statistics Proper 


1 Introduction 


I think any professional statistician, whether or not he found himself 
in sympathy with the preceding chapters, would feel that, even allow- 
ing for the abstractness expected in a book on foundations, those chap- 
ters do not really discuss his profession. He would not, I hope, find the 
same shortcoming in this and the succeeding chapters, for they are con- 
cerned with what seems to me to be statistics proper. The purpose of 
the present short chapter is to explain this transition and to serve as a 
general introduction to its successors. 


2 What is statistics proper? 


So far as I can see, the feature peculiar to modern statistical activity 
is its effort to combat two inadequacies of the theory of decision, as I 
have thus far discussed it. In the first place, there are the vagueness 
difficulties associated with what in § 4.2 were called “unsure probabili- 
ties.”” Second, there are the special problems that arise from more than 
one person’s participating in a decision. 

From the personalistic point of view, statistics proper can perhaps be 
defined as the art of dealing with vagueness and with interpersonal 
difference in decision situations. Whether this very tentative defini- 
tion is Justified, later sections and chapters will permit the statistical 
reader to judge. At any rate, vagueness and interpersonal difference 
are the concepts that, directly or indirectly, dominate the rest of this 
book. 

I will not try to discuss vagueness in this chapter, but something 
may profitably be said here about interpersonal differences. 


3 Multipersonal problems 


As I have already frequently said, it seems to me that multipersonal 
considerations constitute much of the essence of what is ordinarily 
called statistics, and that it is largely through such considerations that 
the achievements of the British-American School can be interpreted in 
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terms of personal probability. This is a view that can best be defended 
by illustration, and the requisite illustrations will be scattered through- 
out later chapters; but some support is lent to it by those critics of 
personal probability who say that personal probability is inadequate 
because it applies only to individual people, whereas the methods of 
science are, more or less by definition, those methods that are accepta- 
ble to all rational people. 

The sort of multipersonal problems I mean to call attention to are 
those arising out of differences of taste and judgment, as opposed to 
those, so familiar in economics, arising out of conflicting interests. As a 
matter of fact, the latter type of multipersonal situation can, if one 
chooses, be regarded as among the former; it may, for example, be 
said that you and I have different tastes for the process of taking a dol- 
lar from me and giving it to you. 

Though modern statisticians do not at all deny the existence of dif- 
ferent tastes in different people, only occasionally do they take that 
difference explicitly into account. In particular, the theory of utility 
has scarcely ever entered explicitly into the works of statisticians. Our 
intellectual ancestors who believed in the principles of mathematical 
expectation were less tolerant than modern statisticians in so far as 
they denied rationality in those whose tastes departed from that prin- 
ciple, and some of their bigotry is occasionally met with today. 

In dealing with multipersonal situations, it is clearly valuable to 
recognize those in which the people involved may all reasonably be 
expected to have the same éastes, that is, utilities, with respect to the 
alternatives involved in the situation. Explicit attempts to discover 
general circumstances under which people’s tastes will be identical are 
rare. The most important and fruitful attempt of this sort is repre- 
sented by D. Bernoulli’s idea that utility functions will typically be 
approximately linear within sufficiently confined ranges of mcome. 
Consciously or unconsciously, that principle is repeatedly appealed to 
throughout statistics; it was, for example, brought out in § 6.5 that the 
very idea of an observation depends for its practical value on Bernoulli’s 
principle of approximate linearity. 

Relatively inexplicit exploitations of similarity of taste are sometimes 
made in statistics. The idea is often expressed, for example, that the 
penalty for making an estimate discrepant from the number to be esti- 
mated will, for everyone concerned, be proportional (within a reason- 
able range) to the square of the discrepancy; an argument for this prin- 
ciple as a rule of thumb appropriate to many contexts will be given in 
§ 15.5. Again, there are situations in which it is agreed that the pen- 
alty will depend only on the discrepancy and not on the true value of 
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the number to be estimated. Of course, there are problems in which 
both rules are invoked simultaneously, the penalty being supposed to 
be proportional to the square of the discrepancy and independent of 
the value to be estimated. 

Turn now to differences in judgment, that is, to differences in the 
personal probability, for different people, of the same event. Though 
modern objectivistic statisticians may recognize the existence of dif- 
ferences of judgment, they argue in theoretical discussions that statis- 
tics must be pursued without reference to the existence of those differ- 
ences, indeed without reference to judgment at all, in order that con- 
clusions shall have scientific, or general, validity. To put the same 
idea in personalistic terms, I would say that statistics is largely devoted 
to exploiting similarities in the judgments of certain classes of people 
and in seeking devices, notably relevant observation, that tend to min- 
imize their differences. 

The tendency of observation to bring about agreement has been il- 
lustrated in § 3.6. Some of the other general circumstances in which 
different people may be expected to agree, or at least nearly agree, in 
some of their judgments have also been mentioned. For example, it 
may well happen that different people are faced with partition prob- 
lems that are the same in that the same variable is to be observed by 
each person, but differ in that each person has his own a priori proba- 
bilities 8 and his own set of available acts F. If, however, the condi- 
tional distribution of x given B; is the same for each person, then the 
people will, for example, agree as to whether a contraction y of x is 
sufficient, which is often of great practical value. Again, there are cir- 
cumstances under which each of these same people will agree that cer- 
tain derived acts are nearly optimal. 


4 The minimax theory 


In recent years there has been developed a theory of decision, here 
with due precedent to be called the minimax theory, that embraces so 
much of current statistical theory that the remaining chapters can 
largely be built around it. The minimax theory was originated and 
much developed by A. Wald, whose work on it is almost completely 
summarized in his book [W3]. Wald’s minimax theory, of course, de- 
rives from, and reflects the body of statistical theory that had been 
developed by others, particularly the ideas associated with the names of 
J. Neyman and E. 8. Pearson. It seems likely that, in the development 
of the minimax theory, Wald owed much to von Neumann’s treatment 
of what von Neumann calls zero-sum two-person games, which though 
conceptually remote from statistics, is mathematically all but identical 
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vith study of the minimax rule, the characteristic feature of the mini- 
nax theory. 

Wald in his publications, and even in conversation, held himself 
aloof from extramathematical questions of the foundations of statistics; 
ind therefore many of the opinions expressed in later chapters on such 
yoInts in connection with the minimax theory were neither supported 
10r opposed by him. It may fairly be said, however, that he was an 
objectivist and that his work was strongly motivated by objectivistic 
ideas. 

My policy here of holding difficulties of mathematical technique to a 
minimum by making stringent simplifying assumptions will be adhered 
to in connection with the minimax theory. A large part of Wald’s book 
[W3] is concerned with overcoming the difficulties in technique that are 
here avoided by simplifying assumptions, but that must be faced in 
many practical problems. Despite Wald’s able effort, important prob- 
lems of analytic technique still remain in connection with the minimax 
theory. It should also be appreciated that the individual mathematical 
problems raised by applications of the minimax theory are often very 
awkward, even when stringent simplifying assumptions are complied 
with; consequently much work on specific applications of the theory is 
still in progress. 


CHAPTER 9 


Introduction to 
the Minimax Theory 


1 Introduction 


This chapter explains what the minimax theory is, almost without 
reference to the theory of personal probability. This course seems best, 
because the theory was originated from an objectivistic point of view 
and as the solution of an objectivistic problem. Moreover, a philo- 
sophically more neutral presentation seems to result, if the ideas of per- 
sonal probability are here kept out of the foreground. 

The minimax theory begins with some of the ideas with which the 
theory of personal probability, as developed in this book, also begins. 
In particular, the notions of person, world, states of the world, events, 
consequences, acts, and decisions presented in §§ 2.2-5 apply as well 
to the minimax theory—from which they were in fact derived—as to 
the theory of personal probability. 

The point at which the two theories depart from each other is § 2.6, 
which postulates that the person’s preferences establish a simple order 
among all acts. That assumption is necessarily rejected by objectivists, 
for it, together with the sure-thing principle (which they presumably 
accept), implies the existence of personal probability. For objectivists, 
of course, conditional probability does not apply to all ordered pairs of 
events. More specifically, it seems to be a tacit assumption of objecti- 
vistic statistics that the world envisaged in any one problem is parti- 
tioned into events with respect to each of which the conditional proba- 
bilities of all events (ignoring the mathematical technicality of measura- 
bility considerations) are defined, but that conditional probability with 
respect to sets other than unions of elements of the partition are not 
defined. That, incidentally, is why partition problems dominate objec- 
tivistic statistics. The partition in question is in general infinite, but, 
for mathematical simplicity, it will here be assumed to be a finite par- 
tition B;. 

The objectivistic position is not in principle opposed to the concept 
of utility. In particular, the minimax theory is predicated on the idea 
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that the consequences of those acts with which it deals are measured 
numerically by a quantity the expected value of which the person 
wishes to have as large as possible, whenever (from the objectivistic 
point of view) the concept of expected value applies. It will therefore 
be doing the minimax theory little or no injustice to postulate here, as 
elsewhere, that the consequences of acts are measured in utility. 

These preliminaries disposed of, the general objectivistic decision 
problem is to decide on an act f in some given F, by criteria depending 
only on the conditional expectations E(f | B;), and therefore without 
reference to the ‘‘meaningless’”’ P(B,). 

Taking any personalistic or necessary point of view literally, it is 
nonsensical to pose an objectivistic decision problem, that is, to ask 
which f of F is best for the person, without reference to the P(B;). On 
the other hand, many, if not all, holders of objectivistic views, like Wald, 
find themselves logically compelled by two widely held tenets to con- 
sider such problems meaningful. First, for reasons I have alluded to in 
Chapter 2 and will soon expand upon, many theoretical statisticians 
today agree, at least tacitly, that the object, or at any rate one object, 
of statistics is to recommend wise action in the face of uncertainty—a 
point of view that Wald was particularly active in bringing to the fore. 
Second, statisticians of the British-American School, of which Wald is 
to be considered a member, are objectivists and are therefore committed 
to the view that the probabilities P(B,) are meaningless, or, at any 
rate, that they cannot be legitimately used in solutions of statistical 
problems. 

So far as I know, Wald is the only one who has proposed any solution 
to the general objectivistic decision problem, barring minor variations. 
His proposal, which is here called the minimax theory, is rather compli- 
cated to state. In view of its complexity and the importance of this 
theory for the rest of this book, and for statistical theory generally, I 
hope the reader will have particular patience with the present chapter. 


2 The behavioralistic outlook 


Prior to Wald’s formulation of what is here called the objectivistic 
decision problem, the problems of statistics were almost always thought 
of as problems of deciding what to say rather than what to do, though 
there had already been some interest in replacing the verbalistic by the 
behavioralistic outlook. The first emphasis of the behavioralistic out- 
look in statistics was apparently made by J. Neyman in 1938 in [N3], 
where he coined the term “inductive behavior” in opposition to “‘in- 
ductive inference.”’ In the verbalistic outlook, which still dominates 
most everyday statistical thought, the basic acts are supposed to be 
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assertions; and schemes based on observation are sought that seldom 
lead to false, or at any rate grossly inaccurate, assertions. 

The verbalistic outlook in statistics seems to have its origin in the 
verbalistic outlook in probability criticized in § 2.1, which in turn is 
traceable to the ancient tradition in epistomology that deductive and in- 
ductive inference are closely analogous processes. 

I, and I believe others sympathetic with Wald’s work, would analyze 
the verbalistic outlook in statistics thus: Whatever an assertion may 
be, it is an act; and deciding what to assert is an instance of deciding 
how to act. Therefore decision problems formulated in terms of acts 
are no less general than those formulated in terms of assertions. 

If, on the other hand, a sufficiently broad interpretation is put on the 
notion of assertion, perhaps every decision to adopt an act can be re- 
garded as an assertion to the effect that that act is the best available, 
in which case the difference between the verbalistic and the behavioral- 
istic outlooks is only terminological; but I do think that, even under 
such an interpretation, the behavioralistic outlook with its tendency 
to emphasize consequences offers the better terminology. 

Fallacious attempts to analyze away the difference between the ver- 
balistic and behavioralistic viewpoints are also sometimes put forward, 
especially in informal discussion. For example, it is sometimes said 
that one should act as though his best estimate of a quantity were in 
fact the quantity itself. But on that basis few of us would buy life 
insurance for next year, for we do not typically estimate the year of 
our death to be so close. Other examples are discussed by Carnap in 
Section 50 of [Cl]. 

If assertions are, indeed, to be interpreted as a special class of acts 
of particular importance to statistics, I have no clear idea what that 
class may be; but it would presumably exclude certain acts, such as the 
design of an experiment, that surely are of importance to statistics. 
Actually the verbalistic outlook has led to much confusion in the foun- 
dations of statistics, because the notion of assertion has been used in 
several different, but always ill-defined, senses, and because emphasis 
on assertion distracts from the indispensable concept of consequences. 
I conclude that the behavioralistic outlook is clearer, fuller, and better 
unified than the verbalistic; and that such value as any verbalistic con- 
cept may have it owes to the possibility of one or more behavioralistic 
interpretations. 

This analysis is really too brief and must be supplemented by certain 
remarks. To begin with, the reader may wonder whether the verbalistic 
outlook has adherents who defend it against the behavioralistic, and if 
so what their arguments may be. Actually, the statistical public seems 
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to greet the behavioralistic outlook as a relatively new idea—how old 
it may actually be is beside the point here—which as such must be re- 
garded with some skepticism. To the best of my knowledge, however, 
only one objection against the behavioralistic outlook has been pre- 
sented. It must be discussed next. 

It has been seen as an objection to the behavioralistic outlook that 
the consequences of some assertions, particularly those of pure science, 
are extremely subtle and difficult to appraise. As a function of the true 
but unknown velocity of light, what, for example, will be the conse- 
quences of asserting that the velocity of light is between 2.99 x 101° 
and 3.01 x 10'° centimeters per second? But, if some acts do have 
subtle consequences, that difficulty cannot properly be met by denying 
that they are acts or by ignoring their consequences. Certain practical 
solutions of the difficulty are known. For example, considerations of 
symmetry or continuity may, as is illustrated in Chapters 14 and 15, 
make a wise decision possible even in some cases where the explicit 
consequences of the available acts are beyond human reckoning. Again, 
analysis sketched in the next two paragraphs tends to show that asser- 
tions with extremely subtle consequences play a smaller role in science 
and other affairs than might at first be thought. 

No worker would actually publish—indeed no journal would accept 
—as research the hypothetical assertion about the.velocity of light men- 
tioned in the paragraph above. The consequences might be subtle, if 
he did; but they would not be very important, for no one would take 
him seriously. An actual worker would do as much as was practical 
to say what observations relevant to the velocity of light he, and per- 
haps others, had performed and what had been observed. To be sure, 
his statement of the observations would typically be much condensed; 
he would resort to sufficient statistics or other devices to put his reader 
rapidly in position to act as though the reader himself had made the 
observations. Assertions about the velocity of light, and countless 
others of that sort, are of course published in textbooks and handbooks. 
These assertions do indeed have complicated consequences, so judgment 
is called for in the compilation of such books; but the seriousness of the 
consequences of their assertions is limited because of the possibility of 
referring to original research publications, a possibility serious text- 
books and handbooks facilitate by the inclusion of bibliographies. 

On the other hand, it is obvious that many problems described ac- 
cording to the verbalistic outlook as calling for decisions between asser- 
tions really call only for decisions between much more down-to-earth 
acts, such as whether to issue single- or double-edged razors to an army, 
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how much postage to put on a parcel, or whether to have a watch re- 
adjusted. 
It is time now to turn back to objectivistic decision problems. 


3 Mixed acts 


Speaking with pedantic strictness, it might be said that Wald does 
not propose a solution for the general objectivistic decision problem, 
because, before undertaking a solution, he insists that F be subject to 
a certain condition. On the other hand, he argues that the condition 
is typically met in practice; he might fairly have insisted that it is the 
very heart of much actual statistical practice. Before discussing the 
issue in detail, let me give a small but typical illustration of it. 

Suppose that in a rental library I am confronted with the choice be- 
tween two detective stories, each of which looks more horrifying than 
the other. At first sight it would seem that only two acts are open to 
me, namely, to rent one book or the other, but Wald points out that 
there are other possibilities, not ordinarily thought of as such. In par- 
ticular, I can eliminate one of the books by flipping a coin. More accu- 
rately and more generally, I can let my choice depend on the outcome 
of a random variable that is utterly irrelevant to the fundamental par- 
tition—in this example, a random variable the outcome of which is in- 
dependent of the relative merits of the two books. The random varia- 
ble may as well be confined at the outset to two values corresponding to 
the rental of one or the other of the books, and random variables as- 
signing the same probabilities to the books are equivalent for the pur- 
pose at hand. In practice, especially serious statistical practice, such 
random variables are, taking reasonable precautions, readily provided 
by coins, cards, dice, tables of random numbers, and other devices. 

In terms of the general objectivistic decision problem, Wald’s point 
can (except for mathematical technicalities) be formulated thus: If f, 
represents a finite number of elements of F, and ¢(r) is a corresponding 
set of non-negative numbers such that 2¢(r) = 1, then the person can 
make the mixed act 


(1) f= D7) (rf, 


available to himself by observing at no appreciable cost a random varia- 
ble taking the values r with corresponding probabilities ¢(7) irrespec- 
tive of which B; obtains, so F may be assumed to include f. Techni- 
cally, the sum in (1) should, for full generality, be replaced by an inte- 
gral with respect to a probability measure. But such integrals become 
superfluous under the simplifying asssumption, which is herewith made, 
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that there are in F a finite set of acts f,, to be called primary acts, with 
respect to which every act in F can be represented in the form (1). In 
the rental-library example, the two acts corresponding to the two books 
can be regarded as primary. 

Since mixed acts are also available from the personalistic point of 
view, it may well be asked whether it is advantageous to consider them 
in connection with that point of view, and, if not, how they can be of 
advantage from one point of view but not the other. The answer to 
the first part of the question is easy. Indeed, if f is defined by (1) then 
it is personalistically impossible that f should be definitely preferred to 
every f,, that is, that 


(2) E(f) = 2) d(r)E(é,) > max E(f,), 


for a weighted mean cannot be greater than all its terms. Technical 
explanation of the efficacy of mixed acts from the objectivistic point of 
view can best be presented after the whole statement of the minimax 
rule, but those at all familiar with modern statistical practice will de- 
rive some insight from the remark that the usual preference of statis- 
ticlans for random samples represents a preference for certain mixed 
acts. 


4 Income and loss 


It is sometimes suggestive, and in conformity with some statistical 
(though not quite with economic) usage, to refer to EF (f | B;) as the 
income of f when B; obtains, and, correspondingly, to use the notation 
I(f; 2). An important concept associated with the income is that which 
I shall refer to as the loss (symbolized by L(f; 7)) incurred by the act f 
when B; obtains. By that I mean the difference between the income 
the person could attain if he were able to act with the certain knowledge 
that B,; obtained and that which he will attain if he decides on f when 
B; does in fact obtain. Formally, 


(1) L(f; 1) = p¢ max I(f’; 7) — I(f; 2). 
# 


If the person decides on f when B; obtains, L(f; 7) measures in terms of 
income the error he has made. If he were himself informed of B; after 
f had been chosen, which is not typically the case, L(f; 7) would, so to 
speak, measure his cause for regret. On that account, some have pro- 
posed to call loss “‘regret,’’ but that term seems to me charged with 
emotion and liable to lead to such misinterpretation as that the loss 
necessarily becomes known to the person. On the other hand, the 
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term “‘loss” has been used by Wald in the sense of negative income, 
but in contexts where loss as defined here is, of the two senses, the only 
defensible one, as will be explained in § 8. I hope the sense proposed 
here will not cause serious confusion. 


Exercises 


1. For each 7, there is at least one primary act f, such that 


(2) I(f,; 71) = max I(f; 2). 


Such a primary act may fairly be called correct for 7. 

2. L(f; 4) = Zo(r)LE,; 7) > 0, equality holding if and only if f is a 
mixture of acts correct for 7. 

3. L(f; 7) = max I(f,-; 4) — I(f; 2). 


4. L(f; 2) = —I(f; 72), if and only if 
(3) max I(f,; 7) = 0. 
r 


5 The minimax rule, and the principle of admissibility 


The most characteristic feature of the minimax theory is a certain 
rule of behavior, or recommendation to the person. This rule, to be 
called the minimax rule, can now be formulated thus: Decide on an 
act f’, such that 


(1) max L(f’; 7) = min max L(f; 2), 
i f ¢ 


where f and f’ are, of course, confined to F. 

In words, the minimax rule recommends the choice of such an act 
that the greatest loss that can possibly accrue to it shall be as small as 
possible. An f satisfying the recommendation of the minimax rule will 
be called a minimax act, and the greatest loss that can accrue to a mini- 
max act will be called the minimax value of the (objectivistic) decision 
problem and written L*. Under the simplifying assumptions that have 
been made, it is not technically difficult to show that at least one mini- 
max act exists. The statement of the rule can be reasonably extended 
to mathematically more general situations, but a digression about this 
possibility is not appropriate here. The name of the rule is presumably 
derived from the abbreviation ‘min max” in (1) or from the Latin 
phrase ‘minimum maximorum”’ thus abbreviated. 
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It may well happen that F contains more than one act that is mini- 
max for the problem, in which case the minimax rule recommends, not 
a particular act, but only that the choice be narrowed to the set of 
minimax acts. Some other criterion must then be invoked to narrow 
the choice further. In particular, it can be shown that at least one of 
the minimax acts is admissible, in the sense of § 6.4. As Wald indicates, 
it would, therefore, be an inexcusable violation of the sure-thing prin- 
ciple not to narrow the choice to admissible acts. This application of 
the sure-thing principle will be called the principle of admissibility. 
The minimax rule and the principle of admissibility constitute the sub- 
ject matter of, and thereby define, the minimax theory. 


6 Illustrations of the minimax rule 


It would be hard to imagine an objectivistic decision problem simpler 
than that of whether to make an even-money (or more accurately, even- 
utility) bet in favor of a certain event or to refrain from betting. That 
problem, therefore, provides a convenient first example of the minimax 
rule and the concepts associated with it. Supposing, as one may with- 
out loss of generality, that the bet is for one utile, the objectivistic de- 
cision problem is completely described by Table 1, which gives the in- 


TaBLE 1. THE INCOME OF AN EVEN-MONEY BET, I (f,; 1) 


Event 
Act 
By By 
Bet, f;| 1 =| 
Don’t bet, fe 0 0 


come of each of the two primary acts for each of the two elements of 
the partition corresponding to the event in question and its com- 
plement. 

In view of Exercises 4.2 and 4.3 the corresponding loss function is 
described by Table 2. Therefore, 


(1) max L(f; 7) = max 2¢(r)L(f,; 2) 


= max ¢(i) > 4, 
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equality obtaining if and only if ¢(1) = ¢(2) = 4. Therefore, L* = 3, 
and the only minimax act is f = 4f, + fo. 


TaBLE 2. THE LOSS OF AN EVEN-MONEY BET, L(f,; 1) 


Event 
Act 
B, By 
f; 0. 1 
fy 1 0 


In this problem, therefore, the minimax rule recommends that the 
person decide, in effect, by flipping a fair coin. If the odds in the bet 
had not been even, the minimax rule would have recommended the 
use of a coin with a certain bias; this more general example will be 
worked out in detail in § 12.4. It is noteworthy in connection with the 
present problem—for it happens in many others—that, for the minimax 
act f, L(f; 7) = L* for every value of 7. 

The following more elaborate example, illustrating the mechanism of 
observation, is paraphrased from a slightly incorrect example in [82]. 
Of three numbered coins, two are pennies and one is a dime, or else one 
is a penny and two are dimes. This gives rise to a sixfold partition B,, 
because any of the three coins may be the singular one, and in two ways. 
The available primary acts are described in two stages thus: First, the 
person may select one of the coins by number for observation, or he 
may refrain from so doing; second, he must guess at the denomination 
of the singular coin. His income in utiles is defined by the following 
conditions: 


1. If the singular coin is a penny, he must pay a tax of 10; if it isa 
dime, he receives a bonus of 20. 

2. If he chooses to observe a coin, he must pay an inspection fee of 
1, regardless of the particular coin selected for observation. 

3. If his guess is incorrect he pays a penalty of 8. 


It is easy to see that the first of the three terms in the person’s in- 
come is irrelevant to his loss, since his decision does not affect the mag- 
nitude of that term. His loss is therefore the sum of two terms. The 
first of these is 1 or 0 depending on whether he decides to make an ob- 
servation; the second is 0 or 8, depending on whether his guess is correct. 
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If the person chooses not to pay the inspection fee, it is clear from the 
preceding example that, no matter what he does, his loss may be as 
high as 4, and that it 1s certain to be that small if and only if he governs 
his guess (essentially) by the flip of a fair coin. 

Suppose next that the person decides to make an observation. If 
he selects any particular coin for observation, he is as badly off as he 
was before the observation, and he has in addition incurred the inspec- 
tion fee. Thus, even if the person knows that the first coin is a penny, 
there is nothing he can do to be sure that his total loss will not be more 
than 5, and, as before, he can guarantee that small a loss only by govern- 
ing his guess with the flip of a fair coin. 

I think every practicing statistician would say that, if an observation 
is to be made at all, one of the three coins should be selected at random 
(i.e., the probability 1/3 should be attached to observing each of them) 
and after the observation the person should guess that the singular 
coin is opposite in denomination to the one observed. It will be shown 
in the next paragraph that this common-sense act is minimax. 

In the first place, the loss L(fp; 7) for the act fp in question is, for each 
i, equal to 1 + 4 X 8 = 32, which is less than 4; for the inspection fee 
is 1 and the probability of making a wrong guess, which would result 
in the loss of 8, is 1/3. To show that fp is minimax, it will be enough to 
show that every act can result in a loss of at least 32. One possibility 
for doing this (which in § 12.3 will be shown to be a natural one to try) 
is to show that, for a certain set of weights, the weighted average of 
L(f; ¢) with respect to 7 is at least 32 for all f. In fact, it is sufficient, 
in view of Exercise 4.2, to establish such an inequality for the primary 
acts. In the present example, it happens that the weights can be cho- 
sen to be equal. What is to be shown, then, is that the following in- 
equality obtains for every primary f. 


(1) Lf) =pr§ LU Lf; 4) = 33. 


Now, if the primary act f does not involve observation, L(f) = 4; be- 
cause three of the six terms to be averaged are then 8, and the other 
three are 0. Suppose next, for definiteness, that f involves the obser- 
vation of the first coin; there are then three possibilities to consider. 
First, the guess is made without regard for the denomination observed, 
in which case the observation is, so to speak, thrown away, making 
L(f) = 5. Second, the denomination guessed may be the same as the 
denomination observed, in which case the guess will be wrong for four 
of the six values of 7, making L(f) = 63. Finally, the denomination 
guessed may be the opposite of the one observed, in which case the guess 
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will be wrong for two of the six values of 7, making L(f) = 3%. This 
argument shows that L* > 32; and, since L(fp; 1) = 3% for every i, fy 
is a minimax act and L* = 32. It would not be difficult to show that 
fp is the only minimax act for this problem. 


7 Objectivistic motivation of the minimax rule 


The minimax rule recommends an act for the person to choose; more 
strictly, it recommends a sharp narrowing of his choice. But how can 
this particular recommendation be motivated? To the best of my 
knowledge no objectivistic motivation of the minimax rule has ever 
been published. In particular, Wald in his works always frankly put 
the rule forward without any motivation, saying simply that it might 
appeal tosome. Though my heart is no longer in the objectivistic point 
of view, I will in the next few paragraphs suggest a relatively objecti- 
vistic motivation of the rule. 

I evolved this far from satisfactory argument at a time when I took 
the objectivistic view for granted. Now, as a personalist, it still seems 
interesting to me in that it shows, or at least suggests, how statistical 
devices combat vagueness, a topic I find very difficult to discuss di- 
rectly. On a different level, the argument may shed light on the per- 
sonalistic view by suggesting how personalistic ideas entered the mind 
of at least one objectivist. 

A categorical defense of the minimax rule seems definitely out of the 
question. Suppose, for example, that the person is offered an even- 
money bet for five dollars—or, to be ultra-rigorous, for five utiles— 
that internal combustion engines in American automobiles will be obso- 
lete by 1970. If there is any event to which an objectivist would refuse 
to attach probability, that corresponding to the obsolescence in ques- 
tion is one. As the example centering around Tables 6.1-2 makes clear, 
the minimax rule recommends that the bet be taken or rejected accord- 
ing as a fair coin falls heads or tails. Yet, I think I may say without 
presumption that you would regard the bet against obsolescence as a 
very sound investment, agreeing that provision for adequate interest 
and compensation for changes in the value of money is implicit in meas- 
urement of income in utiles. 

On the other hand, there are practical circumstances in which one 
might well be willing to accept the rule—even one who, like myself, 
holds a personalistic view of probability. It is hard to state the cir- 
cumstances precisely, indeed they seem vague almost of necessity. 
But, roughly, the rule tends to seem acceptable when L* is quite small 
compared with the values of L(f; 7) for some acts f that merit serious 
consideration and some values of z that do not in common sense seem 
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nearly incredible. Suppose, for example, that I were faced with such 
a decision problem, in which it may be assumed for simplicity that there 
is only one minimax act f, and consider how I might defend the choice 
of that act to someone who proposed another to me. He might, for 
example, tell me that he knows from long experience, or by a tip from 
his broker, that some act g is preferable to f. ‘Well,’ I might say, ‘‘I 
have all the respect in the world for you and your sources of informa- 
tion, but you can see for yourself—for it is objectively so—that the 
most I can lose if I adopt f is L*.”’ He will not be able to say the same 
for g, and in many actual situations the greatest possible loss under g 
may be many times as great as L* and of such a magnitude as to make 
a serious difference to me should it occur, which may well end the argu- 
ment so far as I am concerned. 

It is of interest, however, to imagine that my challenger presses me 
more closely, reminding me that I am a believer in personal probability, 
and that in fact I myself attach an expected loss L to g that is several 
times smaller than L*. Even then, depending on the circumstances, I 
might answer frankly that in practice the theory of personal probability 
is supposed to be an idealization of one’s own standards of behavior; 
that the idealization is often imperfect in such a way that an aura of 
vagueness is attached to many judgments of personal probability; that 
indeed in the present situation I do not feel I know my own mind well 
enough to act definitely on the idea that the expected loss for g really 
is L; but that I do, of course, feel perfectly confident that f cannot re- 
sult in a loss greater than L*, a prospect that in the case at hand does 
not distress me much. 

It seems to me that any motivation of the minimax principle, ob- 
jectivistic or personalistic, depends on the idea that decision problems 
with relatively small values of Z* often occur in practice. The mecha- 
nism responsible for this is the possibility of observation. The cost of 
a particular observation typically does not depend at all on the uses to 
which it is to be put, so when large issues are at stake an act incorporat- 
ing a relatively cheap observation may sometimes have a relatively 
small maximum loss. In particular, the income, so to speak, from an 
important scientific observation may accrue copiously to all mankind 
generation after generation. 


8 Loss as opposed to negative income in the minimax rule 


As a variant to the minimax rule as I have stated (or perhaps I should 
say interpreted) it, one might consider the possibility of letting the 
negative of income play the role of the loss in (5.1). Indeed, strictly 
speaking, Wald himself always proposed the minimax rule in that 
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form. I believe he never made written allusion to the rule formulated 
in terms of loss (as “loss” is defined here); orally he took the position 
that loss and the form of the minimax rule based on it were inventions 
of mine, toward which he was tentatively sympathetic. There is vir- 
tually no mathematical difference between the two rules, and it was 
characteristic of Wald’s approach to the foundations of statistics to be 
reluctant to commit himself with respect to any other differences. 

Though the minimax rule founded on the negative of income seems 
altogether untenable, as will soon be explained, and though no one but 
myself seems to question that I originated the variant of the theory 
based on loss, little or no originality is attributable to me in this re- 
spect. Wald more than foreshadowed the idea, for, though he based 
his minimax rule on the negative of income, he made it clear in publica- 
tions, including [W3], that he regarded as typical problems in which 
the income has, for every 7, the property specified in Exercise 4.4. 
Therefore, in the situations Wald regarded as typical, the distinction 
between the two forms of the rule vanishes, so, until hearing his ex- 
plicit disavowal, I considered the idea of loss as opposed to negative 
income his. 

To see that the minimax rule founded on the negative of income is 
utterly untenable for statistics, consider, for example, a twofold parti- 
tion problem with two primary acts in which the income is as in Table 1. 


Taste 1. I(f,; 2) 


Event 
Act 
B, By 
f; —l —1 
fo —10 1 


Now, if the person were interested in minimizing the maximum of the 
negative income, he would have no recourse but to decide on f,, in which 
case (but in no other) he could be sure that the negative income would 
be at most 1, whichever B; obtained. This may not in itself seem ob- 
jectionable, but suppose now that the person has available free of cost 
an observation, however relevant to B;. Then, no matter what derived 
act he chooses, if B, obtains, his negative income will be at least 1 
utile; and, to be sure that it is not more, he again has no recourse but 
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to decide on f;. In short, for the problem at hand, the person’s behavior 
would not be influenced by any observation, however relevant. This 
seems to me absurd on the face of it, but perhaps the absurdity can be 
brought out by a less abstract situation paralleling the example just 
given. A person has a ladder, and, just as he is about to use it, it oc- 
curs to him that the ladder may possibly be dangerously defective. 
He envisages two basic primary acts: f;, to throw the ladder away and 
buy a new one, which will cost 1 utile in either event; and fo, to use the 
ladder, which will, if the ladder is defective, result in his injury to the 
extent of 10 utiles, and will, if the ladder is sound, accomplish his ob- 
ject, which is worth 1 utile. Now, if the person acts on the principle of 
minimizing the maximum of negative income, he will throw the ladder 
away, no matter what tests tend to show that it is sound. 


CHAPTER 10 


A Personalistic Reinterpretation 


of the Minimax Theory 


1 Introduction 


In this chapter a reinterpretation of the minimax theory, based on 
the theory of personal probability and the idea that statistical problems 
are typically multipersonal, is tentatively put forward. The reinter- 
pretation is based on a model or scheme that captures, I believe, much 
of the essence of actual statistical situations, but it may be possible to 
effect that end with other equally simple and even more realistic models; 
for the one to be presented here leaves much to be desired. In struc- 
ture, this chapter is kept roughly parallel with Chapter 9, to enable the 
reader to examine as closely as he may wish the parallelism between the 
objectivistic interpretation given there and the personalistic one given 
here. In particular, the liberty is taken of giving old symbols new mean- 
ings in order to bring out the parallelism between the two interpreta- 
tions. 


2 A model of group decision 


Consider a group of people, indexed by numbers 7. These people are 
supposed to have the same utility function, at least for the consequences 
to be considered in the present context, but their personal probabilities 
are not necessarily the same. The group of people is placed in a situa- 
tion in which it must, acting in concert, choose an act f from a finite 
set of available acts F, the consequences of the acts being measured in 
terms of the common utility of the members of the group. 

The situation just described will be called a group decision problem. 
It is epitomized by a jury. The members of the jury, in legal theory, 
are supposed to have common value judgments in connection with the 
legal matters at hand; for these are incorporated in the law as stated 
in the instructions of the court. But it is part of the very concept of a 
jury that its members may be of different opinions; that their judgments 
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as to questions of fact may differ; that, to put it technically, they may 
have different systems of personal probability. Still other situations 
resembling the group decision problem are widespread in science and 
industry, though the group decision problem does by no means repre- 
sent the only sort of social interaction tending to make the theory of 
personal probability, confined to a single person, inadequate. When- 
ever a hospital or a factory modifies its procedures, whenever a doctrine 
is adopted with little reservation by virtually all the workers in a 
science, or whenever a panel of experts drafts a report, something like 
group decision is taking place. 

Since the members of the group in a group decision problem, though 
required to act in concert, typically differ from one another in their 
probability judgments, it is too much to expect that any rule can be 
formulated that will be acceptable to, or in any sound sense proper for, 
all groups under all circumstances. On the other hand, there may be 
one or more rules of thumb that will lead the group to an acceptable 
compromise in many practical circumstances. Two such suggestions, 
the group minimax rule and the group principle of admissibility, will 
be made and explored in the next section. 


3 The group minimax rule, and the group principle of admissibility 


In the first place, the possibility of using mixed acts is to be pointed 
out. If, for example, you and I, walking together, disagree about which 
branch of a fork in the road leads home, we can, and in fact may, de- 
cide which to try by flipping a coin. 

In general, mixed acts are available in a group decision problem for 
reasons analogous tc their availability in objectivistic decision prob- 
lems, for, though the members of a group may generally differ in the 
probabilities they personally assign to some events, there is in practice 
an abundance of events associated with coins, cards, random numbers, 
and the like that make it possible for the group to mix the primary acts 
in any proportion, all members of the group being in agreement about 
what the proportions are. The example of the fork in the road illus- 
trates how the use of mixed acts can effect such a compromise as to 
make decision possible in what might otherwise be an impasse. As in 
the account of the objectivistic decision problems, it will therefore be 
taken for granted from now on that F contains all mixtures of its ele- 
ments, and once more, for mathematical simplicity, it will be assumed 
that there are a finite number of primary acts f, in F, of which all 
others are mixtures. 

The ith person in the group attaches a certain expected utility, or 
(personal) income, to the act f; call it If; 7). In the judgment of the 
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ith person, adoption of the act f would represent a (personal) loss, 


(1) L(f; 2) = max If; 4) — I(f; 4). 


(possibly zero) as compared with the income or expected utility that 
in his opinion would result from an act he considers most promising. 

The group minimax rule is the suggestion that an act be adopted 
such that the largest loss faced by any member of the group will be as 
small as possible. To put it formally, the suggestion is that an f’ be 
adopted such that 


(2) max L(f’; 7) = L* =p; min max L/(f; 72). 
i f i 


The parallelism between the group minimax rule and the minimax rule 
stated in § 9.5 is great. In particular, (2) is identical in appearance 
with (9.5.1). This is really only a pun, though a fruitful one, because 
L, t, and even f have altogether different meanings in the two contexts. 

As indicated at the outset, it cannot be expected that the group mini- 
max rule will, or reasonably should, be accepted by every group faced 
with every problem. But, much as in the corresponding objectivistic 
decision problems, it may happen that, if L* is small, in a rather vague 
sense, the group will accept the group minimax rule. Indeed, if L* is 
small, the group minimax rule requires no member of the group to face 
a large loss, so no member will feel that the suggestion is a serious mis- 
take. In any event, no member of the group can suggest an alternative 
that will not make some member’s loss as great as L*, for there is none. 
Moreover, in many problems the group minimax rule will lead to the 
same loss L* for every member of the group (as is explained in § 12.3), 
a circumstance which, when it occurs, may add to the acceptability of 
the suggestion by making it seem fair. 

Of course it is possible that, as in the objectivistic interpretation, 
more than one act fulfilling the minimax principle exists. Here, a para- 
phrase of the principle of admissibility will further narrow the choice, 
for if 


(3) L(g; 1) < L(t; 2) 

for every 7, with inequality obtaining for some 7, the group cannot seri- 
ously consider f. 

4 Critique of the group minimax rule 


Some of the criticisms that have been, or may be, raised against the 
minimax rule can as well be discussed in connection with one interpre- 


10.4] CRITIQUE OF THE GROUP MINIMAX RULE 17% 


tation as with the other, and Chapter 13 will be devoted to such criti- 
cisms. But some that bear specifically on the multipersonal interpre- 
tation in this chapter should be discussed here. 

In the first place, the group minimax rule is flagrantly undemocratic. 
In particular, the influence of an opinion, under the group minimax rule, 
is altogether independent of how many people in the group hold that 
opinion. In general, it is difficult to give a formal analysis of the concept 
of democratic decision, a point discussed at length by Arrow [A5], Hil- 
dreth [H4a], and others. Perhaps, considering that the people in the 
group are postulated to have a common utility function, a satisfactory 
analysis of democratic decisions could be given in the case of a group 
decision problem by some such procedure as minimizing the average 
with respect to 7 of L(f;7). But, in many situations in which I envisage 
application of the group minimax principle, the group will in fact be a 
rather nebulous body of people, for example the group of all specialists 
in some field. The principle would in such a case be administered by a 
single member of the group somewhat in the following fashion. In 
planning an investigation, the results of which he intends to publish, 
he will endeavor to take account of all opinions, so far as he can know 
or guess them, that are considered at all reasonable in his field of 
investigation. And when he publishes his results he will say, in 
effect, ‘‘Whatever reasonable opinions have heretofore been held by 
members of this specialty, in the light of my investigation and the min- 
imax rule, it is now proper for the members of the specialty, in so far 
as they are called upon to act in concert, to agree to such and such an 
action.” To put it a little differently, in such an application the group 
is rather fictitious, and the individual investigator is admitting as rea- 
sonable a rather large class of opinions, but excluding many that he 
is sure his confreres will agree are utterly absurd. He will, for example, 
feel quite free to exclude those opinions that almost all educated people 
regard as superstitious. 

The group minimax rule is also objectionable in some contexts, be- 
cause, if one were to try to apply it in a real situation, the members of 
the group might well lie about their true probability judgments, in 
order to influence the decision generated by the minimax rule in the 
direction each considers correct. This objection is, however, scarcely 
serious in the fictitious sort of application suggested above. 

It is appropriate, in terminating this section, to discuss a certain dis- 
tinction, neglect of which can, as was pointed out to me orally by Bruno 
de Finetti, lead to serious misunderstanding of the group minimax rule. 
Voluminous observation typically tends to make any one person almost 
certain of the truth, and also, when a group of people is involved, it 
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typically tends to make L* small. These two tendencies, though re- 
lated, are separate phenomena, as an illustration will bring out. 

Suppose that Peter and Paul are required to bet 1 utile in concert 
either that the majority of a large electorate has voted for, or that it 
has voted against, a certain issue; but that before betting they are to 
be allowed to examine a random sample of 1,001 ballots. 

If specific opinions about the division of the electorate are assigned 
to Peter and Paul, the situation can be regarded as a group decision 
problem. To start with an interesting extreme possibility, suppose 
that it 1s Peter’s unequivocal opinion that 55% of the electorate is for 
and 45% is against the issue and Paul’s that the division is 45% for 
and 55% against; that is, Peter, for example, is supposed to act as 
though he knows that the division 1s 55%-45%. 

If, finally, it is understood that the group decision problem consists 
in the two people, Peter and Paul, deciding, before the sample is ac- 
tually observed, how their bet is to be determined by the composition 
of the sample; then the unique minimax act is to bet that the electorate 
majority is whatever the sample majority happens to be. Granting 
this easily established solution of the minimax problem, it is obvious 
that the two people both face the minimax loss L*. Peter, to be specific, 
regards L* as the probability that through random fluctuation the sam- 
ple will accidentally fail to corroborate his “knowledge” that the ma- 
jority is for the issue. Numerically, L* is about 0.0008. 

Peter and Paul, recognizing that the possibility of observing the 
sample reduces the minimax loss to about 0.0008 as compared with the 
0.5 that it would be if no sample were available, may well find the min- 
imax act a satisfactory compromise; at any rate, it is hard to see in 
this situation how they could arrive at any other. 

Though the incorporation of the sample into the problem has greatly 
reduced L*, observation of the sample does not affect the opinion of 
either person in the slightest, for unequivocal opinions such as they 
hold are not subject to modification in the light of evidence. At least 
one of the two people is immovably wrong, and the observation of no 
sample, however large, can bring them both close to the truth. This 
brings out a contrast between the reduction of L* and the approach to 
certainty of the truth, both of which typically occur with the accumu- 
lation of evidence. 

The same contrast 1s expressed by remarking that, though the two 
people may readily adopt the minimax act, each feeling that at the ex- 
pense of a small risk he is diverting the obstinacy of his colleague to 
their common good; after the observation of the sample, one or the 
other of them is bound to feel that the prize has been lost by a sad 
and improbable accident. 
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The wary will ask, ‘‘Who will feel how, when the actual majority is 
lisclosed and settlement made? What if Peter’s unequivocal opinion 
turns out to be false?”’ Such questions suggest that paradox lurks in 
an example in which different people unequivocally hold mutually in- 
2xonsistent opinions, so there is some interest in considering a modifica- 
tion of the example, free of that objectionable feature. 

Suppose then that Peter and Paul, though strongly opinionated about 
the division of the electorate, are not absolutely unequivocal in their 
opinions. To be quite definite, suppose that Peter attaches probability 
t-10—!° to the division 55%-45% and probability 10~!° to the divi- 
sion 45%-55%, and that Paul attaches the same probabilities but in 
the opposite order to the two divisions. Here, as in the example of the 
unequivocal opinions, the unique minimax act is to let the bet be chosen 
in accordance with the sample majority; L* is a trifle lower than before. 
Observation of the sample does now generally affect the opinions of the 
two people, but, though it radically reduces the minimax loss, it does 
not typically bring the two people into close agreement. If, for ex- 
ample, the division is in fact 45%-55%, Paul’s strong a priori belief 
that that is the actual division is almost sure to be strengthened by the 
sample, and Peter’s equally strong but false belief is almost sure to be 
weakened. Still, the probability is only about 1/2 that Peter will be 
led by the sample to attach an a posteriori probability even as great 
as 0.05 to the actual division. Thus, speaking loosely but practically, the 
approach to certainty of the truth is here not typically nearly so far 
advanced by observation as is the reduction of the minimax loss.* 

It may not be superfluous to point out that the preceding paragraph 
alludes not only to the two different personal probability systems of 
Peter and of Paul, but also to certain conditional probabilities that 
you and I have accepted hypothetically in setting up the example. 

Whichever division does actually obtain, it is rather probable that, 
once the sample is observed, either Peter or Paul will wish he could 
break his contract. This seems to me to reflect a serious objection to 
the group minimax principle, especially in those applications in which 
the members of the group are not literally consulted, for people cannot 
be expected to abide by disappointing contracts they might have made 
but didn’t. 

For other approaches to the group decision problem see de Finetti 
[D6], [D7a], de Finetti (1954), Staél von Holstein (1970, p. 65 and ff.). 
and Winkler (1968). 


+ As de Finetti has remarked, the separation between the two phenomena is 
more clearly brought out if Peter and Paul decide which bet to make on the 
basis of a tennis match between themselves. For, if each thinks himself much 
the superior player, L* will be depressed, though the opinions of Peter and 
Paul about the election remain completely unaffected by the outcome of the 


CHAPTER 11 


The Parallelism between 
the Minimax Theory and 
the Theory of Two-Person Games 


1 Introduction 


John von Neumann, in 1928 [V3], developed a theory of games in 
which two people play each other for money.t This theory is mathe- 
matically so closely akin to that of the minimax rule and has had such 
influence on its development that it would be artificial to give an expo- 
sition of the minimax rule without saying something of the theory of 
what von Neumann calls zero-sum two-person games, though the ac- 
count given here must necessarily be highly compressed. The most 
convenient references in English to the theory of zero-sum two-person 
games, should the reader be interested in a fuller account, are [B18], 
[M3], and Chapters IT and III of [V4]; though those who read German 
may find it best to start with the expository sections of the paper [V3] 
in which von Neumann first discussed the subject. 

The sort of systematic punning by which the formal parallelism be- 
tween the objectivistic and personalistic minimax theories was empha- 
sized in Chapter 10 will be used once more, to bring out the formal 
parallelism between those theories and that of zero-sum two-person 
games. Logic will be still further sacrificed to clarity and convenience 
by calling the two people who play the game ‘‘you” and “TI.” 


2 Standard games 


A certain sort of game, here called a standard game, is defined thus: 
You secretly choose a number r from a finite set of possibilities, and I 
secretly choose a number 7, also from a finite set of possibilities. The 
numbers r and 7 having been chosen, you pay me the sum of money 
(possibly negative) L(r; 7), where L is an arbitrary function of r and i, 
known to both of us. It is assumed that, for the sums involved, each 
of us finds money proportional to utility. 

+ In this completely independent development he was to some extent anticipated 


by Emil Borel. Consult [F9], [F10], and [B21] for details and further references. 
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At first sight, standard games look very dull, though it is immediately 
recognized that some such games are played. A tiny but typical ex- 
ample is the game of ‘‘Button, button, who’s got the button?”’; “Stone, 
paper, scissors” is almost as familiar an example; and others could be 
mentioned. But, and this seems remarkable at first, any game, except 
possibly those dependent on physical skill, can be viewed as a standard 
game. The great generality of standard games is demonstrated in de- 
tail in Chapter II of [V4], but informal discussion of a single example 
will render the idea intuitively clear. Suppose then that you and I are 
to play a game of poker (of a specified variety). At first sight poker 
does not seem to be a standard game, because it involves several ran- 
dom events, and several decisions on the part of each of us, some to be 
made in the light of others. But, it can be argued, there are only a 
finite number of different situations that can arise in the course of a 
game of poker. You could, therefore, in principle write into a notebook 
exactly which choice you would make in each of the possible situations 
with which you might be faced in playing poker with me. The number 
of possible ways of compiling such notebooks, or policies of play, is 
finite; so, except for limitations of time and patience, you will be at 
no disadvantage in playing one game with me, if you simply chose 
once and for all that one of the many possible policies of play that seems 
best to you. Similarly, from my point of view, the game consists, in 
principle, in choosing one policy of play. Once you have chosen one 
of the policies possible for you, say the rth, and I have chosen one of 
the policies possible for me, say the ith, the amount you will have to 
pay me at the termination of the game is a random variable. Since it 
is agreed that the payments are effectively in utiles for both of us, your 
payment to me is effectively the expected value of this random variable, 
which may be called L(r; 7) and which is in principle known to both 
of us as a function of r and 7. The elaborate game of two-person poker 
is thus exhibited, at some expense to realism, as a standard game. 

Regarding the choice of an r by you or an 7 by me as a primary act, 
both of us are at liberty to use mixed acts. Indeed, explicit attention 
apparently was first called to the possibility of using mixed acts by 
Borel (see [B21]), in just this context. 

Let f and g represent mixed acts assigning probabilities ¢(r) and y(2) 
to the values 7 and 7, respectively. The standard game is now replaced 
by a somewhat different game in which you choose an f; I choose a g; 
and you pay me the amount L(f; g), where 


(1) Lf; g) = pt D L(r; Do(r)v(d. 
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3 Minimax play 


Von Neumann adduces an argument, the statement of which will be 
briefly postponed, that, if you have respect for my intelligence, you will 
see to it that the most I can possibly take from you shall be as small 
as possible, that is, you will choose an f’ for which 


(1) max L(f’; g) = L* =p; min max L(f; g). 
g f 4 


Symmetrically, according to his argument, I should choose a g’ such 
that 


(2) min L(f; g’) = Lx =p, max min L(f; g). 
f g f 


Since, making the recommended choice, you are sure that you will 
not pay me more than L*, and I am correspondingly sure that you will 
not pay me less than Lx; it follows that L+ < L*. This inequality 
would, of course, have obtained even if mixed acts were not permitted. 
It is a remarkable mathematical fact (not to be proved in this book) 
that, permitting mixed acts, equality always obtains; so the special 
symbol Lx is superfluous here. 

The argument for the recommended choices rests on the equality of 
L* and Lx. You realize that I can take at least L* from you and that, 
if you are not careful, I may take more. On the other hand, I realize 
that you can prevent my taking more than L* from you and that, if 
I am not careful, I may get less. This suggests to many that a pair of 
intelligent players, each respecting the intelligence of the other, will 
each adopt one of the recommended acts. 


4 Parallelism and contrast with the minimax theories 


Some formal parallelism between the minimax theories of decision 
and the theory of zero-sum two-person games is evident, but the paral- 
lelism is much more complete than may appear at first sight. The mix- 
tures g are without counterpart in the two minimax theories of deci- 
sion, and the appearance of g in (38.1) at the place where 7 appears in 
(9.5.1) may seem to mar the parallelism between these two equations. 
But, letting 


(1) Lif; 4) =p 21 L(r; )4¢(r), 


in the game theory (in close parallelism with the decision theories), 


(2) Lif; g) = 2 L(f; 7)v(@) < max Lif; 2), 
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and 


(3) max L(f; g) = max L(f; 2). 


Therefore (3.1) is equivalent to 
(4) max L(f’; 7) = min max L(f; 7) = L*. 
i f i 


Thus from the point of view of the minimax theories of decision the 
g’s represent no material innovation and are at worst useless baggage. 
Actually, though of little if any relevance in the interpretation of the 
minimax theories, the g’s constitute a useful mathematical device. 
Their usefulness has in fact been illustrated in working out the second 
example in § 9.6 and will be systematically demonstrated in the next 
chapter, along with the usefulness of the apparently irrelevant “‘maxi- 
min” problem posed by (8.2) and of the fact that Lx = L*. 

Some remarks on the possibility of interpreting the g’s in the minimax 
theories are postponed to the end of this section. 

In the game theory, L may be any function whatsoever of its argu- 
ments r and 7, but, in the decision theories, L is subject to the condition 
that, for every 1, 


(5) min L(r; 7) = 0, 


where L(r; 2) 1s of course to be interpreted as L(f,; 7). Here is the only 
mathematical difference between the game theory and the decision 
theories, the former being mathematically slightly more general than 
the latter. 

Though the mathematical differences are negligible, the intellectual 
difference between the situations leading to the game theory on the 
one hand and to the decision theories on the other is great. Serious 
misunderstandings of the (objectivistic) minimax theory have often re- 
sulted from identifying it with the game theory. Among other things, 
loss is then confounded with negative income, and the misconception 
that the (objectivistic) minimax rule is ultrapessimistic is created. I 
have even heard it stated on this account that the minimax rule amounts 
to the assumption that nature is malevolently opposed to the interests 
of the deciding person. 

Though mathematical convenience seems to be the basic reason for 
introducing the g’s in the minimax theories, it is tempting to ask whether 
the g’s have also some natural interpretation in those theories. At the 
moment, I do not see a convincing interpretation in either theory, but 
completeness demands an account of an interpretation suggested by 
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Wald for his version of the objectivistic theory, especially since this 
interpretation influenced some of Wald’s most widely used terminology. 

The objectivistic problem of deciding on an act in ignorance of which 
partition element B; obtains, the P(B;) being regarded as meaningless, 
suggests a new problem that may perhaps also be called objectivistic. 
The new problem arises on postulating that P(B;) is meaningful but 
utterly unknown, that is, P(B;) = y(z), where the y(z)’s are the com- 
ponents of a g here interpreted as the a priori distribution unknown to 
the deciding person. 

Since for Wald ‘“‘loss’”’ was synonymous with “negative expected in- 
come,” he naturally calculated the loss of the new problem thus: 


(6) L(f; g) = —E(£| g) 
> —E(f | B)P(B) 


3 


= DLE; i), 


arriving thus at the very function suggested by the game theory. In 
Wald’s version of the theory, the new problem therefore amounts to 
the formal introduction of the g’s in connection with the old one, which 
neatly fulfills the reasonable expectation that there should be no ma- 
terial difference between regarding P(B;) as meaningless and regarding 
it as meaningful but utterly unknown. 

The suggested interpretation of a g as an unknown—or, to mirror 
Wald more faithfully, fictitious—a priori distribution does not work, 
however, if the loss function of the new problem is defined by (9.4.1), 
for the new function L(f; g) is not then generally the same as the func- 
tion L(f; g) suggested by the game theory; thus 


(7) Lif; g) = max E(f’ — £| 8) 


max 2) E(t’ — £| B.)v(@) 


max 2, (Lif; 4) — Lt’; )}v@ 


L(f; g) — min L(f'; g) 
< Lif; g), 


equality holding for a typical g (1.e., a g such that y(z) > O for every 7) 
only in the altogether trivial situation that F is dominated by one of 
its elements. 
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Does this mean that, contrary to expectation, there is a material dif- 
ference between the new problem with loss L and the old one? The fol- 
lowing exercises show that it does not. 


Exercises 
1. max L(f; g) = max L(f; 7). 
g $ 


2. min max L(f; g) = L*. 
f & 

3. max L(f; gz) = L*, if and only if max L(f;7) = L*. 
g i 


CHAPTER 12 


The Mathematics 


of Minimax Problems 


1 Introduction 


Since the two different minimax decision theories and the theory of 
zero-sum two-person games have a common mathematical core, it will 
be worth while to digress for a chapter even at the expense of some 
repetition, to discuss this common core mathematically, that is, vir- 
tually without reference to its various possible interpretations. The 
discussion will have to be drastically confined relative to the large body 
of relevant literature, but the reader who wishes to pursue the subject 
much further will find [B18], [V4], [W3], and [M3] to be key references. 


2 Abstract games 


To begin with a very general situation, which will later be specialized 
to the one of main interest, let f and g denote generic elements of any 
two abstract sets, and let L(f; g) be the value of an essentially arbitrary 
real-valued function. It will, however, be assumed for simplicity that 
for every f’ and g’ the quantities 


max L(f’; g), min L/(f; g’) 
g f 
(1) 


* = ps min max Lf; g), Lx = p¢ max min L{f; g) 
f ¢ . 


exist. To say that a maximum, for example, exists is not only to say 
that the function in question is bounded from above, but also that the 
maximum value is actually attained for at least one value of the argu- 
ment. For want of a more neutral term, call the function L(f; g) an 
abstract game. 

An f’ is called minimax, if and only if 


(2) max L(f’; g) = L*; 
4 
and a g’ is called maximin, if and only if 
(3) min L(f; g’) = Lx. 
f 
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The existence of minimax and maximin values of the variables is im- 
plicit in (1). It is an easy exercise to show that f’ is minimax, if and 
only if 


(4) Lf; g) < L* 


for every g. 
The corresponding characterization of maximin g’’s as those such 
that 


(5) L(f; g’) > Lx 


for every f could similarly be shown. But the symmetry of the situa- 
tion is such that it would be superfluous to derive this characterization 
of a maximin explicitly. Indeed, every theorem, or general conclusion, 
about L(f; g) obviously has a dual, which arisés on applying the theo- 
rem to the new abstract game L(g; f) with L(g; f) = —L(f; g). This 
is typical of what is known in mathematics as a duality principle. Hence- 
forth the duals of demonstrated conclusions, even when not explicitly 
stated, will be as freely used as the demonstrated conclusions them- 
selves. Some conclusions are of course self dual. Incidentally, another 
example of a duality principle was used in § 5.4, and a very important 
one was pointed out in connection with Boolean algebra in § 2.4. 

An argument showing that Lx < L* was given in connection with 
the theory of games. More formally, if f’ and g’ are, respectively, mint- 
max and maximin, then from (4) and (5) 


(6) L* > L(f’; g’) > Lx. 


It is possible, indeed typical, that Lx < L*. Suppose, for example, 
that f and g are variables that take only two values and that L(f; g) 
is described by Table 1. Here, as the reader should verify, both f’s 


TaBLeE 1. L(f; g) 


& 

1 2 
1;0 1 
f 
2/1 0 


and both g’s are minimax and maximin, respectively, and L* = 1, 
Ix = 0. 

The following theorem is frequently applicable to the identification 
of minimax and maximin values of f and g, and of L* and Dx. 
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THEOREM 1 If f’, g’, and the number C are such that L(f’; g) < C 
< L(f; g’) for every f and g; then L* = Lx = C = Lif’; g’), f is mini- 
max, and g’ is maximin. 


Proor. First, C > L*, because 


(7) C > max L(f’; g) > min max L(f; g) = L*; 
g f g 


and, dually, C < Lx. But Le < L*; so C < Le < L* <C, that is, 
L* = Lx =C. Now (4) and (5) apply. @ 


CoRoLLARY 1 If f’ and g’ are such that L(f’; g) < L(f; g’) for every 
f and g; then f’ and g’ are, respectively, minimax and maximin, and L* 
= Tx = L(f’; g’). 


3 Bilinear games 


If one stumbles somehow onto a pair f’, g’ satisfying the hypothesis 
of Corollary 2.1, then he has discovered a minimax, a maximin, and 
the values (in this case equal to each other) of L* and Lx. But that 
possibility of discovery does not exist unless L* = Lx, which at the 
level of generality of the last section is unusual. Almost all real inter- 
est, however, centers on a very special class of abstract games, here to 
be called bilinear games, for which it is demonstrable that L* is in- 
variably equal to Lx. 

The definition of bilinear games involves several steps. First, con- 
sider an abstract game, L(r; 2), based on a pair of variables, r and 7. 
The two variables are here assumed for simplicity to have only a finite 
number of possible values, an assumption that can, and for statistics 
must, be considerably relaxed. Next, let f and g be non-negative func- 
tions of r and 7, respectively, arbitrary except for the constraint that 


(1) Life) = Do =1, 


in short, probability measures on the r’s and 2’s, respectively. Finally, 
the bilinear game L(f; g) is defined thus. 


(2) Lf; g) =pt 21 L(r; Of (r)g(@). 


It is important to recognize that the duality principle continues to 
hold, that is, if L(f; g) is a bilinear game, then L(g; f) = —L/(f; g) is 
also one. 
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In terms of the auxiliary functions 
Lf; t) =pt DY L(r; Hf (r), 


(3) ee 
L(r; 8) =v 2, L(r; i)9(4), 


the following equalities and inequalities can easily be verified by the 
reader. 
max L(f; g) = max L(f; 7), 
g ; 


(4) 
min L(f; g) = min L(r; g). 
f r 


(5) min max L(r; 7) > min max L(f; 2) = L* > Lx 
r $ f 3 
= max min L(r; g) > max min L(r; 1). 

g r $ r 


But more can be said in connection with (5), for it has been shown by 
von Neumann [V3] that for the special class of functions now under 
discussion L* is actually equal to Lx. This important equality cannot 
conveniently be proved here, but the interested reader can refer to the 
relatively simple proof given by von Neumann and Morgenstern in 
Section 17.6 of [V4] (reading first, if necessary, the introduction to the 
mathematics of convex sets that constitutes Chapter 16 of that book) 
or to the version of it presented in [B18]. 
In the light of the equality of Z* and Lx, (5) becomes 


(6) min max L(r; 7) > min max L(f; 1) = L* 
r t f t 


= max min L(r; g) > max min L(r; 2). 
g r $ r 


In view of (4) and (6), Theorem 2.1 can be much improved upon for 
bilinear games: 


THEOREM | For bilinear games, the following three conditions on 
f’, g’, and C are equivalent: 

1. ff minimax, g’ maximin, and L* = C. 

2. L(f’;g) < C < Lif; g’) for every f and g. 

3. Lif’;7) < C < L(r; g’) for every 7 and r. 


Proor. Condition 2 implies 1, by Theorem 2.1; 1 implies 3 by (6); 
and 3 implies 2 by (4). @ 
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CoROLLARY 1 A necessary and sufficient condition that f be mini- 
max is that, for some g, L(f; 7) < L(r; g) for every r and 7. Under 
that condition L* = L(f; g), and g is maximin. 


Corollary 1 seems an especially appropriate expression of Theorem 1 
in connection with the minimax decision theories, where the g’s are, after 
all, not really of interest in themselves. Theorem 1, and equivalently 
Corollary 1, are of great practical value. To be sure, there are algo- 
rithms, or rules (given by Shapley and Snow in [812]), by which L* 
and all minimax values of f can in principle be computed, but these al- 
gorithms are so awkward to apply that in practice one generally guesses 
one or more minimax f’s, and also a maximin g, on the basis of some 
clues, verifying the guess and evaluating L* by Corollary 1. To finish 
the job, one then finds, if one can, an argument to show that the mini- 
max f’s thus discovered are all there are. This rather imperfect pro- 
cedure is especially important, since it can relatively easily be extended 
to many situations in which r and 7 are not confined to finite ranges, as 
does not seem to be true of the algorithms. 

As was mentioned in § 10.3 and as the examples that have been given 
illustrate, if f is miimax, then L(f; 2) is in practice often actually equal 
to L* for all, or at least many, values of 7. Insight into that phenome- 
non is given by the following theorem. 


THEOREM 2 If 7 is such that there exists a maximin g for which 
g(t) > 0, then L(f; 7) = L* for every minimax f. 


Proor. L(f; 7) < £*, because f is minimax. Therefore L(f; g), be- 
ing a weighted average of the L(f; 2z)’s, is at most L*; and it is actually 
less, if any term with positive weight is not equal to L*. But L(f; g) 
> L*, because g is maximin. @ 


It can happen, and in statistical practice it often does happen, that 
every 7 satisfies the hypothesis of Theorem 2, in which case L/(f; 7) = 
L* for every 7 and every minimax f. 

Theorem 2 often provides a basis for guessing a minimax f, a maximin 
g, and the value of L*, which can then be checked by application of 
Corollary 1. To take a simple example, suppose that there are n values 
of r, and n of t. There may be some reason to conjecture that each 7 
is used by some maximin g, that is, that each 7 satisfies the hypothesis 
of Theorem 2. If the conjecture is in fact true, then f(r) and L* satisfy 


the system of equations 
> I(r) + OL* = 1 


“) >> L(r; Df(r) — 1L* = 0. 
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Typically, (7) as a system of n + 1 linear equations in n + 1 variables 
will have exactly one solution (f(r), £*). This solution, if the conjec- 
ture is valid, will actually consist of the components of a minimax f 
(in this case the only one) and the value of L*. But the conjecture is 
not yet confirmed. In particular, if any f(r) in the solution of (7) is 
negative, it is contradicted; if not, the investigation can proceed. The 
candidates for maximin values of g are now, by the dual of Theorem 2, 
among the solutions of the system. 


D Ig(i) + OL* = 1 
(8) 


Dd L(r; gi) — 1L* = 0, 

r 

where r is confined to the values for which f(r) > 0. To consider only 
the simplest and most typical case, suppose f(r) > 0 for every r. Re- 
garding L* as known, (8) consists of » + 1 equations for n variables, 
which at first sight might be expected generaliy to have no solution. 
To put the matter differently, if one forgets for the moment that L* 
has been determined by (7), it might seem possible that (8) could lead 
to a different value, say L*’. But, using the latter part of (8) and then 
the first part of (7), it is seen that 


(9) LD Ler; Mf M)9@ = DS(r)L" = L*, 


and dually the double sum equals L*; so discrepancy between L* and 
L* is not among the real snags in the tentative program—irrespective 
of the number of r’s participating in (8). Finally, if (8) leads to even 
one set of positive g(2)’s, it follows from Corollary 1 that the f and L* 
derived from (7) are the unique minimax and the true value of L*, re- 
spectively. 

The converse of Theorem 2 has been proved by Bohnenblust, Karlin, 
and Shapley in [B19], though their proof cannot be reproduced here. 
As is pointed out by these authors, the converse does not extend at all 
readily to situations involving infinite ranges of r and 7. Theorem 2 
and its converse can be summarized thus: 


THEOREM 3 There exists a maximin g for which g(z) > 0, if and 
only if L(f; 7) = L* for every minimax f. 


4 An example of a bilinear game 


It is now convenient to discuss a certain example, or rather a class of 
examples, of bilinear games, namely those in which 7 takes only two 
values, say 1 and 2. Two preliminary remarks will help to orient the 
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discussion. First, bilimear games in which 7 takes only one value are 
devoid of interest, for the minimax problem in that case is simply a 
problem of finding an ordinary minimum. Second, the discussion of bi- 
linear games in which 7 takes only two values includes, in effect, be- 
cause of the duality principle, the discussion of those in which r takes 
only two values. 

If « takes only the two values 1 and 2, the values g = {g(1), g(2)} 
can be represented graphically by points on an interval, as illustrated 
at the foot of Figure 1. For every r, L(r; g) is linear as a function of 


| gy ——> + gq 


Figure 1 


g, as is L(f; g) for every f. It is, of course, just because the L(f; g) of a 
bilinear game is linear in this sense and its dual that I use the term ‘‘bi- 
linear.” In Figure 1 the five slanting solid lines represent the five linear 
functions L(r; g) of a bilinear game in which r (for illustration) takes 
five values and 7 takes two. The dashed lines represent two values of f, 
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each of which has for simplicity been so chosen as to use, or mix, only 
two values of r. 

As may be verified by inspection, the particular bilinear game rep- 
resented by Figure 1 has the special property that min L(r; 7) = 0 for 
each 7, which is the distinguishing property of those bilinear games that 
arise in connection with the minimax decision theories described in 
Chapters 9 and 10. 

Figure 1 bears a more than accidental resemblance to Figure 7.2.1. 
In particular, the concave function 


(1) min L(r; g) 


marked by heavy line segments in Figure 1 is closely analogous to the 
convex function so marked in Figure 7.2.1. The particular g empha- 
sized by Figure 1 is that for which the function (1) attains its maximum 
value, which according to (3.6) is L*. This g is therefore the unique 
maximin. It has been shown quite generally in [B19] that bilinear games 
with more than one minimax or maximin are, in a sense, unusual; 
Figure 1 makes it graphically clear that the special bilinear games now 
under consideration do usually have a unique maximin, because there 
is more than one maximin only in case (1) happens to have a horizontal 
segment. 

What are the minimax f’s for the bilinear game represented by Figure 
1? According to the dual of Theorem 3.2, an r cannot be used in the 
formation of a minimax f unless L(r; g) = L* for the (in this case 
unique) maximin g. That consideration eliminates all but two of the 
r’s from consideration, and it is graphically clear that this will usually 
be the case for bilinear games in which 7 takes only two values. Theo- 
rem 3.2 itself, applied to the particular game under discussion, shows 
that the graph of L(f; g) as a function of g must be horizontal for any 
minimax f. The two preceding conditions together eliminate all values 
of f except the one corresponding to the horizontal dashed line in Fig- 
ure 1; and that f is indeed minimax, because L(f; 7) = L* for both 
values of 7. 

To specialize still further, suppose that 7 as well as 7 takes only two 
values. Such a game can, of course, be represented graphically in the 
spirit of Figure 1. Several qualitatively different situations can occur, 
which might, for example, be classified by the relation of the two linear 
functions L(r, g) to each other. The reader should graph and consider 
many or all of these possibilities for himself. The only one treated 
here will be that in which the two functions cross each other at an in- 
terior g, with one function sloping up and the other down. It is graphi- 
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cally clear that there will then be a unique minimax and a unique maxi- 
min, as will now be shown analytically. 

The condition postulated can be expressed without loss of generality 
thus: 


(2) E(1;2) > 001;1), L(2; 1) > L(2; 2), 


L(2;1) > £(1; 1), L(1; 2) > L(2; 2). 
Or, more mnemonically, 
(3) L(1; 2), L(2;1) > L(1; 1), L(2; 2). 


It is conjectured, in this case on graphical grounds, that the program 
outlined in connection with (3.7—-8) applies, and the reader can indeed 
verify that that program leads to the conclusion 


(4) L* = {L(1; 2)L(2; 1) — L; 1)L(2; 2)}/4, 
where 
(5) A = L(1; 2) + £(2; 1) — L01; 1) — LQ; 2); 


and that the unique minimax f and maximin g are 


6) ! #1) = [L(2; 1) — L(2; 2))/A 
f(2) = [L(; 2) — LG; 1)]/4, 
(7) a [L(1; 2) — L(2; 2)]/A 
g(2) = [L(2; 1) — L(1; 1)]/A. 


If the game arises from an application of the minimax decision theory, 
(3) almost always applies. More precisely, in this case, except possibly 
for the order of numbering, 


(8) L(1; 1) = L(2; 2) =0 and L(1; 2), L(2;1) > 0; 


so, if only the inequalities in (8) are both strict, (3) applies. Then 
(4-7) specialize to 


(9) L* = L(1; 2)L(2; 1)/4, 

where 

(10) A = L(1; 2) + L(2; 1); 

(11) fl) = £(2;1)/4, — f(2) = L(1; 2)/A, 


(12) g(1) = L(1;2)/A, = g(2) = L(2;1)/A. 
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5 Bilinear games exhibiting symmetry 


Mathematically the solution of a bilinear game is often simplified by 
considerations of symmetry. For statistical applications, the implica- 
tions of symmetry for bilinear games are of fundamental importance 
in so far as they represent a counterpart in the minimax theory of the 
disreputable but irrepressible principle of insufficient reason. This sec- 
tion discusses these implications in an elementary, but formal, way. 
It can be skimmed over or skipped outright without much detriment 
to the understanding of later sections. 

Any discussion of symmetry involves, at least implicitly, the branch 
of mathematics known as the theory of groups. Though what is to 
be said here about games exhibiting symmetry is intended to be clear 
without prior knowledge of the theory of groups, it may be mentioned 
that introductions to that subject are to be found in many places, for 
example in [B14]. 

It can, and in practice often does, happen that a bilinear game has 
some symmetry.{ This means that there are permutations, here sym- 
bolized by 7, 7’, etc., of the values of r among themselves and the values 
of 1 among themselves such that 


(1) L(Tr; Ti) = L(r;1) 


for every r and 7, where, of course, 7'’r and T7 are the values into which 
T carries r and 7 respectively. Permutations satisfying (1) are said to 
leave the game invariant, or to belong to the group (of symmetries) of the 
game. The permutation U that leaves every r and every 7 fixed must 
be counted among the permutations in the group of the game, but the 
game has no symmetry (worthy of the name) unless there are other 
permutations besides U in its group. 

An example of a game with high symmetry is the game implicit in 
the second example of § 9.6, for, to any permutation whatsoever of the 
six 7’s in that game among themselves, there is a corresponding permu- 
tation of the r’s such that the two permutations taken together leave 
the game invariant. It was, of course, the exploitation of symmetry 
that made the treatment of that example relatively simple. 

Returning to bilinear games in general, if 7 and 7” are in the group 
of the game, then the product 7'7” defined by the condition that 


(2) (TT"')r =pr T(T'r), (TT’)t =p: T(T'2) 
is obviously also a permutation in the group of the game. This multi- 


+ This concept must not be confused with that of ‘symmetrical games,”’ which are 
symmetrical in the sense that the equation L(r; 1) = —L(i;r) is meaningful and true 
for every r and.2. 
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plication of permutations somewhat resembles the ordinary multipli- 
cation of numbers. In particular, (77’)T” is evidently the same as 
T(T’T’’), though it is not necessarily true that 77” = T’T. 

Relative to this multiplication the permutation U plays the role of 
the unit, or number 1, in arithmetic, for it is obvious that TU = UT 
= T for any permutation 7. 

For every permutation 7, there is evidently a permutation 7 ', and 
one only, that undoes 7’, that is, one such that 7~!7 = U. It is easy 
to see also that 77! = U and that, if T is in the group of the game, 
T—'is too. The notation 7’ is of course motivated by the considera- 
tion that, relative to the multiplication of permutations, 7’! plays the 
role of the reciprocal of T. 

It will be adopted as a definition that Tf and Tg are the functions 
such that Tf(r) = f(T 'r) and Tg(t) = g(T~ 12) for every permutation 
of T and for every r andi. The intervention of 7’! in this definition 
may at first seem arbitrary, but it is motivated by the following con- 
siderations. First, if f is, for example, the function such that f(ro) = 1 
and f(r) = 0 for r ¥ ro, then 7f should be such that Tf(7T'ro) = 1 and 
Tf(r) = 0 for r# Tro. Second, S(Tf) should be (S7)f rather than 
(TS)f. The definition having been adopted, L(7f; Tg) can be calcu- 
lated thus: 


(3) L(T£; Tg) = LL Lr; )f(T~*r)g(T,) 
= oe L(Tr; Ti) f(T Tr)g(T Ti) 
= 2 L(Tr; T1)f(r)g(2), 


where the basic fact is exploited that, if r, 7 runs once through all pairs 
of values, then 7'r, T1 also does so. It follows from (1) and (8) that, if 
T is in the group of the game, then 


(4) L(T£; Tg) = Lf; g). 


An f (g) is called invariant under the group of the game, if and only if 
Tf = f (Tg = g) for every T in the group. There is a natural way to 
construct from any f an f invariant under the group, and dually for g. 
Namely, let 


1 
=pr— >, Tf, 
nT 


1 
Z=p:— >, Tg, 
nT 


(5) 


12.5] BILINEAR GAMES EXHIBITING SYMMETRY 195 


where (here and throughout this section) n is the number of elements 
in the group and the summation is over all elements of the group. The 
definition (5) accomplishes its objective, because 


: 1 
(6) 2 = 22 
r nT -¢ 
= => l= es 1, 
and a 7 
(7) T'f(r) = f(T’) 


== Dror 
nT 


1 - 
= =D PTs) = Fo) 
nT 


for every r and for every 7” in the group. In (7) use is made of the 
easily established facts that T7~!7’—' = (T’T)~' and that as 7 runs 
once through the group so does T’T. The justification of & is, of course, 
dual to that of f. It is noteworthy that f = f, if and only if f is invariant 
under the group of the game. 


Suppose FR (J) is a set of the r’s (2’s). Then, by definition, re TR 
(¢ ¢ TI), if and only if T~'r e R (Ti e J); and the set R (J) is invariant 
under the group of the game, if and only if TR = Rk (TI = I) for every 
T in the group. 


Exercises 


la. If R is invariant, so is ~R. 

lb. If R and R’ are invariant, so are R | R’ and Rk U R’. 

1c. The vacuous set and the set of all r’s are invariant. 

2. For every R, let R =p Ur TR, where T is of course confined to 
the group; and, for every r, define the trajectory of r as [r], where [r] is, 
as is customary, the set whose only element is r. 

(a) R is the smallest invariant set containing R. 

(b) R is the intersection of all invariant sets containing R. 


() R= Uf. 


rek 
(d) [r] is the smallest invariant set of which r is an element. 


3a. If R is invariant, and RN {r] ¥ 0, then R > [r]. 
3b. If R is invariant, and re R, then R > [r]. 
3c. If [r] N [r’] ¥ 0, then [r] = [r’]. 
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4a. The following conditions are equivalent: 
a. R is invariant. 
6. R=R. 
y. For every r ¢R, [r] C R. 
5. R is partitioned ito sets each of which is a trajectory. 
4b. The following conditions are equivalent: 
a. f is invariant. 
8. The set of r’s for which f takes any given value is invariant. 
y. f is constant on every trajectory. 
5a. If T’r = 7, then (T7’T)Tr = Tr. 
5b. If {r} denotes the number of elements of the group that leave r 
fixed, then {r} = {Tr}. 
5c. If || r || denotes the number of elements in [r], then n = {r}|| r||. 
5d. Both {r} and || r || are divisors of n. 
5e. The value of f everywhere on the trajectory of r is 
8) 17 Sw. 


r & [r] 
6. Note the dual of each of the preceding exercises. 


In the establishment of all these preliminaries, the theory of bilinear 
games has been almost lost sight of, but it is now possible to say much 
about the significance of invariant functions and sets for bilinear games. 
I begin with a theorem valued for some of its corollaries rather than 
for any charm of its own. 


THEOREM 1 If L(f’; Tg) < L(f’’; Tg) for every T, then L(f’; g) < 
L(t”; 8). If in addition L(t’; g) < L(t”; g), then L(f’; g) < L(t”; 8). 


PRoor. 
(9) L(T~'f'; g) = L(f’; Tg) < Le”; Tg). 
Therefore 
(10) L@'s8) = — DLT 8) 


1 
<->) L(t”; Tg) = Lif”; &). 
n TT 


If Lf’; g) < L(t’; g), then (9) is strict for 7 = U, and therefore (10) 
is also strict. @ 

CorRoLuARY 1 If L(f’; Tg) = L(f”’; Tg) for every 7’, then L(f'; g) = 
L(t"; ). 
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Corottary 2 = If L(f’; g) = L(f’”’; g) for every g, then L(f’; g) = 
L(f’’; &) for every g. 

CorotLaRy 3 ~=—L-(f; g) = L(f; &) = L/(f; &) for every f and g. 
COROLLARY 4 If f is invariant under the group of the game, L(f; g) 
= L(f; &) for every g. 


Paraphrasing some of the nomenclature of § 6.4, if L(f’; g) < L(f’; g) 
for every g, say that f’ dominates f’’; if f’ dominates f’’, but f’’ does not 
dominate f’, say that f’ strictly dominates f'’; if f’ dominates f’’, and f”’ 
dominates f’, say that f’ and f’’ are equivalent; if f’ is not strictly domi- 
nated by any f, say that f’ is admissible. 


CoRoLLARY 5 If f’ dominates, strictly dominates, or is equivalent 
to f’’, then f’ dominates, strictly dominates, or is equivalent to f’, re- 
spectively. 


Corottary6 = If L(f; Tg) < Lf; Tg) for every T, then Lif; g) = 
Lf; g). 

Corotuary 7 If L(f; 7) < L(f; 2) for every 7 ¢ I, where I is invari- 
ant under the group of the game, then L(f; 7) = L(f; 7) for? ¢ J. 


CoroLLtaRy 8 It is impossible that f strictly dominates f. 
THEOREM 2 max L(f;g) < max L(f;g), equality holding, if and only 


4 & 
if the right-hand maximum is attained for a g invariant under the group 
of the game. 


PRoor. 


(11) max L(f; g) = max L/(f; &) 
g g 

< max L(f; g). 
& 


The inequality in (11) follows from the fact that every & is a g; equality 
holds, if and only if the final maximum is attained for some 2, that is, 
for some invariant g. @ 


CoroLLaARy 9 _sIf f is minimax, so is f. 


CoROLLARY 10 There exists a minimax f invariant under the group 
of the game. 


If a game has more than one minimax f, it is tempting to suppose 
that in statistical, if not in all, applications of the theory an invariant, 
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or symmetrical, minimax f would recommend itself at least as highly 
as any other minimax f. This supposition, being vague, cannot be 
really proved, but certain facts tend to support it. In particular, the 
following theorem is a reassuring improvement of Corollary 10. 


THEOREM 3 There is at least one admissible, invariant, minimax f. 


Proor. It is a direct consequence of a theorem (Theorem 2.22, p. 54, 
of [W3]) of Wald’s, too technical for statement or proof here, that at 
least one invariant minimax f is strictly dominated by no invariant f’. 
If that f were strictly dominated by any f” (invariant or not), it would 
also, according to Corollary 5, be dominated by f’’, which is impossible. 
Therefore f is admissible. @ 


If the bilinear game has high symmetry or, more explicitly, if the 
number of trajectories into which the r’s or the 2’s, or both, are parti- 
tioned is small; the search for invariant minimax f’s and invariant 
maximin g’s is relatively simple. An invariant minimax is character- 
ized as an invariant f’ such that 


(12) max L(f’; g) = min max L(f; g) = L*. 
g f g 


But, since at least one invariant minimax exists, the criterion (12) is 
not changed if the minimization on its right side is confined to invari- 
ant f’s; with f so confined, the criterion remains unchanged, if both 
maximizations are confined to invariant g’s (as Corollary 3 shows). 
Thus the search for invariant minimax f’s and invariant maximin g’s 
amounts to the solution of an abstract game that arises from the origi- 
nal bilinear game by ruling out certain values of f and g, namely the 
un-invariant ones. 

This new and smaller abstract game can be exhibited as a bilinear 
game thus: Let it be understood for the moment that r’ ranges over 
such a set of the r’s that there is exactly one r’ in every trajectory [r]; 
dually for 2’. For invariant f and g, 


(13) Lig) = DD LG; Mf a@) 
=DLL LL Les ofr 


ral re(r] ie [i] 


=) L Fea) DL dy Lr; t) 


r’ re(r] ie [i] 


= DLLME O'@, 


12.5] BILINEAR GAMES EXHIBITING SYMMETRY 199 


where 

Vite ahh — oe rs2 
(14) L'(r'; 7’) DTP Me ey 
and 
(15) f(r’) =ell F093 9G) =e ll 7 []g@’). 


Finally, it is easily verified that, except for the conditions f’(r’) > 0, 
g'(a') > 0, and Zf’(r’) = Zg’(2’) = 1, the coefficients f’(r’) and g’(z’) are 
arbitrary. The new game is therefore to all intents and purposes a bi- 
linear game with only as many r’s and 7s as there are r-trajectories. 
and 1-trajectories, respectively, in the original game. The new game, 
incidentally, may well have symmetry of its own. 

If there is only one r- or one 7-trajectory, the new game is so simple it 
scarcely deserves to be called a game. This occurs, for example, in the 
second example of § 9.6, where there is only one 7-trajectory. In that 
situation there is only one invariant g, and it is equal at every 2 to the 
reciprocal of the total number of 7’s (which is here the value of || || 
for every 7). That g must therefore be an admissible maximin. The 
value of L* is therefore given by 


1 
(16) L* = min] >> L(r; 7). 
r 4 i 
The invariant minimax f’s are those and only those invariant f’s such 
that f(r) = 0 for every r that fails to minimize the sum in (16). More- 
over, here the minimax f’s (invariant or not) are all equivalent, as can 
be argued thus: Any invariant minimax f is such that 


(17) L¢;g) = Lif; g) = L* 


for every g. If any minimax f whatsoever failed to satisfy (17), it 
would strictly dominate f; but according to Corollary 8 that is impos- 
sible. Therefore in the very special situation at hand all minimax f’s 
satisfy (17) and are accordingly equivalent. 

It is, of course, important to extend consideration of symmetry to 
bilinear games with infinite sets of r’s and 7’s, and infinite groups of 
symmetries, but the task has not yet proved straightforward. Two key 
references bearing on it are [L4] and [B17]. 


CHAPTER 13 


Objections to 
the Minimax Rules 


1 Introduction 


I have already expressed and supported my opinion that neither the 
objectivistic nor the personalistic minimax rule can be categorically de- 
fended (§ 9.7 and § 10.3). On the other hand, certain objections have 
been leveled against the objectivistic rule (that being the well-known 
one) that seem to me to call for reinterpretation, if not outright refu- 
tation. 


2 A-confusion between loss and negative income 


Some objections valid against the minimax rule based on negative 
income are irrelevant to that based on loss. The notions that the mini- 
max rule is ultrapessimistic and that it can lead to the ignoring of even 
extensive evidence have already been discussed as examples of such ob- 
jections. 

Another example I would put in the same category has been suggested 
by Hodges and Lehmann [H5]. In this example a person who has ob- 
served n independent tosses of a coin for which the probability of heads 
has an unknown value p is required to predict the outcome of the 
(n + 1)th toss. Hodges and Lehmann here interpret prediction in the 
following somewhat sophisticated, but reasonable, sense. The person 
is, in the light of his observation, required to choose a number p be- 
tween 0 and 1 and to pay a fine of (1 — p)” or p” according as the 
(n + 1)th toss is in fact heads or tails. Thus the (expected) income 
attached to the primary act p and event p is 


(1) I(p; p) = —p(1 — p)? — (1 — pip’ 
= —(p — p)* — p(l — p). 


As Hodges and Lehmann show, the only derived act (mixed or pure) 
that yields the minimax of the negative income is to set p = % irrespec- 
tive of the observation. But it is, in common sense, absurd thus to ig- 
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nore the observation of the first n tosses. In view of this absurdity, 
almost everyone would agree that applying the minimax rule directly 
to the negative of (1) is a foolish act for the person to employ. 

The absurdity of minimizing the maximum of negative income in 
this example is of course no valid argument against minimizing the 
maximum loss. It is easy to see that the loss corresponding to (1) is 


(2) L(p; p) = (p — p)?. 


As Hodges and Lehmann happen to show in the same paper [H5] 
(though in a diffcrent context), and as will be discussed in some detail 
in §4, the unique minimax derived act does use the observations to 
advantage, resulting in a loss of 


(3) 


1 
4(1 + n”)? 


irrespective of p. The absurd act of setting p = 4 irrespective of the 
observation results in the loss (p — 4)°, which in any ordinary context 
would be inferior to (8), especially for large n. 

Incidentally; the minimax derived from (2), though not nearly so 
bad as setting p identically equal to 4, is itself open to a serious objec- 
tion, which will be explained in § 4. 


3 Utility and the minimax rule 


Some objections to the objectivistic, and mutatis mutandis to the 
group, minimax rule are in effect objections to the concept of utility, 
which underlies the minimax rules. Criticisms of the concept of utility 
have already been discussed in Chapter 5, particularly in § 5.6, but 
certain aspects of the discussion need to be continued here. 

It is often said, and I think with justice, that, even granting the 
validity of the utility concept in principle, a person can seldom write 
down his income function /(r; 7) with much accuracy. This idea is 
put forward sometimes with one interpretation and sometimes with 
another. Of these, only the first is strictly an objection to the utility 
concept. 

That one is a dilemma raised by the phenomenon of vagueness. 
Vagueness may so blur a person’s utility judgments that he cannot ac- 
curately write down his income function. I suppose that no one will 
seriously deny this; I would be particularly embarrassed to do so, for 
it is almost a recapitulation of the very argument that leads me, though 
in principle a personalist, to see some sense in the objectivistic decision 
problem. On the other horn, if all meaning is denied to utility (or some 
extension of that notion) no unification of statistics seems possible. 
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Three special circumstances are known to me under which escape from 
the dilemma is possible. First, there are problems in which some 
straightforward commodity, such as money, lives, man hours, hospital 
bed days, or submarines sighted, is obviously so nearly proportional to 
utility as to be substitutable for it. Second, there are problems in 
which exact or approximate minimax decisions can be calculated on 
the basis of only relatively little, and easily available, information about 
the income function, such as symmetry, monotoneity, or smoothness. 
The possibility of cheap extensive observation, which (when it occurs) 
makes the minimax principle attractive, also tends to make many de- 
cision problems fall into both of the two types in which the difficulty 
of vagueness is alleviated. For example, in a monetary decision prob- 
lem with cheap observation available, it often happens that the weak 
law of large numbers, and the like, can be invoked to justify regarding 
cash income as proportional to utility income. 

Third, there are many important problems, not necessarily lacking 
in richness of structure, in which there are exactly two consequences, 
typified by overall success or failure in a venture. In such a problem, 
as I have heard J. von Neumann stress, the utility can, without loss 
of generality, be set equal to 0 on the less desired and equal to 1 on the 
more desired of the two consequences. 

The second sense in which it may, though not quite properly, be 
said to be impossible to write down the income function is typified by 
this example. A manufacturer of small short-lived objects, say paper 
napkins, is faced with the problem of deciding on a program of sam- 
pling to control the quality of his product. He complains that, though 
for this problem his utility is adequately measured by money, he can- 
not write down his income function because he does not know how the 
public will react to various levels of quality—that, in particular, the 
minimax rule does not tell him at all how much he ought to spend on 
the sampling program, though it may say how any given amount can 
best be employed. The manufacturer has a real difficulty, though he 
expresses it inaccurately. He forgets that the lack of knowledge that 
gives rise to the decision problem involves not only the state of his 
product, but also the state of the public; taking the state of the public 
into account, there is no real difficulty in writing down the income func- 
tion. But, if it is not practical for the manufacturer to make observa- 
tions bearing on the state of the public as well as those bearing on the 
state of the product, the minimax rule is not a practical solution to his 
problem; for, rigorously applied, it would remove him from the paper- 
napkin business. I believe that in practice the personalistic method 
often is, and must be, used to deal with the unknown state of the pub- 
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lic, while objectivistic methods, particularly the minimax principle, are 
now increasingly often used to deal with the state of the product—a 
sort of dualism having some parallel in almost all serious applications 
of statistics. This is not to deny that relatively objectivistic methods 
of market research can sometimes be used, nor that there are personal- 
istic elements aside from those concerning the state of the public in 
much of even the most advanced quality control practice. 


4 Almost sub-minimax acts 


Another sort of objection to the objectivistic minimax rule is illus- 
trated by the following example attributed to Herman Rubin and pub- 
lished by Hodges and Lehmann [H5]. An integer-valued random 
variable x subject to the binomial distribution 


n 
(1) P(x| p) = (”) p*(1 — p)"* 


is observed by a person who knows n but not p. His decision problem 
is to decide on a function p of x subject to the loss function: 


(2) L(p; p) = E((p — p)? | p) 
a> (p(x) — p)? (") p*(1 — p)” *. 


In other terms, he must estimate p on the basis of an observation of x 
and subject to a loss equal to the square of his error. The traditional 
estimate of p is defined by fo(x) = x/n. This estimate has many vir- 
tues; it is the maximum-likelihood estimate, the only unbiased esti- 
mate, and (as is shown in [G1]) the only minimax estimate for a some- 
what different problem from that posed by (2). But for (2) the unique 
minimax is (as is shown in {H5]) defined by 


(3 — Bo(z)) 


(3) fala) = Bola) + 

As it is straightforward to verify for every p, 
p(1 — p) 

(4) L(Bo; p) = area 

and 

5 L(p1; p) = —————> 


which constant is, therefore, L*. The ratio of the first of these functions 
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to the second is ; 
1 

(6) 4p(1 — p) (1 + =) | 
n 


the maximum of which occurs at p = 1/2 and is 


1 2 
(7) (1+). 


Thus, for large n, the maximum loss of fp is larger than L* by only a 
slight fraction. Moreover, the loss of po is less than L* except when p 
lies in the interval where 


(8) 4p(1 — p) > (1+ 27%)~, 
that 1s, where 
(9) lp -4| <4 - (Lt 0-4) } 4 (any. 


To take a numerical example, consider n = 10° (which the practical 
will note is rather big for a sample). The advantage of p; over po at 
p = 1/2 is then only 0.64%, and, once p departs by as much as 0.04 
from 1/2 in either direction, the advantage is with pp. It amounts, 
for example, to 3.5%, 15.5%, ©% in favor of po, when p is 0.6, 0.8, 
1.0, respectively. 

Many agree that in such an example good judgment will, under ordi- 
nary circumstances, prefer po to the recommendation of the minimax 
rule, p;. —To my mind, this example constitutes a valid objection against 
the minimax rule, in the sense that it demonstrates once more that, 
whatever value that rule may have, it is at best a rule of thumb. 

The example is a good illustration of the role of personal probability 
in ordinary statistical thinking, for the source of the dissatisfaction a 
person would ordinarily feel for p; as opposed to po stems from the fact 
that he would not ordinarily attach enough personal probability to the 
immediate neighborhood of p = 1/2 to justify preference for p;. It 
follows from the numbers given above, for example, that, if the person 
attaches a probability of less than 0.84 to the interval [0.4, 0.6], he will 
prefer po to p;; the same conclusion can be derived from the supposition 
that the standard deviation of the personal distribution of p is at least 
0.04. Of course, situations can be imagined in which the personal prob- 
abilities would be so concentrated about 1/2 as to justify preference for 
p,; the point of the example is only that there are situations in which 
that would clearly not be the case. 

Interesting material and important references bearing on the phe- 
nomenon illustrated by the decision problem under discussion are given 
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by Wolfowitz in [W17]. It seems to be suggested there that the diffi- 
culty can be met by postulating some small amount ¢« by which the 
person does not mind having his income deereased. Taken literally, 
this postulate implies on repeated application that all imcomes are 
equivalent for the person, but Wolfowitz makes it clear that he does 
not mean to propose the postulate in a sense that allows repeated ap- 
plications. The idea is reminiscent of those theories of probability 
that permit the neglect of an occasional improbable event (mentioned 
in the last paragraph of § 4.4) and seems to me open to an objection 
similar to the one raised in connection with them. In particular, the 
choice of the « would be not only personal, but ill defined as well. 


5 The minimax rule does not generate a simple ordering 


Finally, an objection made by Chernoff [C7] to the objectivistic mini- 
max theory must be discussed. This will entail statement and illus- 
tration of the phenomenon on which the objection is based, and state- 
ment and analysis of the objection itself. 

The phenomenon pertains to the relation between two objectivistic 
decision problems, to be called for the moment the narrow and the 
wide problems. The narrow problem is determined by certain primary 
acts f,; and the wide one is determined by those primary acts and one 
more, say fp. In other words, the wide problem presents the person 
with one more choice than the narrow. Calling the two income func- 
tions I(f; 7) and J9(f; 7), it is to be understood, of course, that I(f; 7) 
= [)(f; 7) for any f that does not use, that is, give positive weight to, 
fp. The corresponding equation does not necessarily obtain for the 
loss functions; indeed it clearly does so, if and only if the maximum of 
Io(f; 7) in f can be attained for each 7 without using fp. Even in case 
no minimax of the wide game uses fo, it is therefore to be expected that 
the minimax f’s of the wide game will be different from those of the 
narrow game. In fact, it can happen that no minimax of the wide game 
uses either fp or any f, used by a minimax of the narrow game; this is 
the phenomenon to be discussed in this section. 

To see how the phenomenon can occur, suppose that Figure 12.4.1 
represents the loss function of the narrow problem; and consider what 
the corresponding figure is for the wide problem, supposing that fo is 
such that 

A =p l1(fo; 2) — max I(f,; 2) > 0, 


(1) 
© =p; max I(f,; 1) — I(f; 1) > 0. 
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It is clear that A and 2 can attain any positive values, irrespective of 
the structure of the narrow problem. The figure for the wide problem 
is constructed thus: The graph corresponding to each f, is left fixed at 
its right end and raised by the amount A at its left, and fp is represented 
by a line sloping up with slope 2 from the lower left-hand corner. It is 
easy to see that the raising of the left ends of the graphs of the f,’s can 
make any f, with a positive slope horizontal. If, further, such an f, 
minimizes L(f; g) for some g, it can be made a minimax by choosing 2 
sufficiently large. Thus, speaking specifically of Figure 12.4.1, the f, 
corresponding to the left segment of the heavy concave graph, which is 
not used in the minimax of the narrow problem, can become the unique 
minimax. Figure 12.4.1 is a little special in that the heavy concave 
graph has only one vertex to the left of the maximin of the narrow prob- 
lem. If there were more than one, the phenomenon could also be ex- 
hibited by making the second vertex to the left the unique maximin, 
which would occur for all A’s and 2’s in a certain range. Thus the phe- 
nomenon occurs not only for isolated values of A and 2 but typically 
for whole domains of values. 

Suppose, to take a striking case, that one f,, say f,-, is the unique 
minimax for the narrow problem and a different one, f,-,, is the unique 
minimax for the wide problem. It is absurd, as Chernoff says in effect, 
to recommend f,, as the best act among the f,’s when only the f,’s are 
available and then to recommend f,., as the best for an even wider 
class of possibilities. Fancy saying to the butcher, ‘Seeing that you 
have geese, I’ll take a duck instead of a chicken or a ham.” 

It is absurd, then, to contend that the objectivistic minimax rule 
selects the best available act. But that is not so devastating to the rule 
as might at first appear, for it is not contended by anyone known to 
me that the rule does select the best. On the contrary, the rule is in- 
voked only as a sometimes practical rule of thumb in contexts where 
the concept of “best” is impractical—impractical for the objectivist, 
where it amounts to the concept of personal probability, in which he 
does not believe at all; and for the personalist, where the difficulty of 
vagueness becomes overwhelming. To have a consistent concept of 
“best,” that is, to have a mode of decision that does not exhibit the 
phenomenon, amounts, as Chernoff himself points out, to the establish- 
ment of a simple ordering of preference among acts. In so far as that 
can be done consistently with the sure-thing principle, personal proba- 
bility is practically defined thereby. If the sure-thing principle is vio- 
lated, the ordering is absurd as an expression of preference. For ex- 
ample, the rule of minimizing the maximum of the negative of income 
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does not exhibit the phenomenon. It amounts to considering f < f’, if 
and only if 


(2) max I(f;7) < max I(f'; 2). 


This establishes a simple ordering, but one that violates the sure-thing 
principle by violating P2. 

The phenomenon has a particularly natural interpretation for the 
group minimax rule. It would not be strange, for example, if a 
banquet committee about to agree to buy chicken should, on being in- 
formed that goose is also available, finally compromise on duck. 


CHAPTER 14 


The Minimax Theory 
Applied to Observations 


1 Introduction 


In this chapter the concept of observation is re-explored from the 
point of view of the minimax rule. In principle, objectivistic and group 
minimax problems should here be treated on an equal footing. But, 
since mathematically the two theories are identical, it seems wisest to 
focus on one, interjecting occasional digressions about the other. I 
have chosen to focus on the objectivistic problems. That choice, being 
in accordance with other literature on the minimax rule, will facilitate 
the reader’s further study of the subject, and it also renders more ob- 
vious the intimate connection between the minimax rules and the theory 
of partition problems presented in Chapter 7. The present chapter 
can indeed be regarded largely as a paraphrase of Chapter 7, so there 
will unavoidably be many references to the notations and conclusions 
of that chapter. 


2 Recapitulation of partition problems 


Paralleling the treatment of observation in Chapters 6 and 7, an 
objectivistic observational problem will be roughly defined to consist of 
an objectivistic problem, regarded as basic; an observation; and a sec- 
ond objectivistic problem, derived from the basic one and the obser- 
vation. 

More explicitly, the bastc problem may be any objectivistic problem. 
It will be characterized by the values of E(f | B;), where f ranges over 
a set of acts F subject to the conditions laid down in § 9.3, and B; is a 
partition. 

The observation is a random variable x (confined, as usual in this 
book, to a finite set of values), subject to the conditional distributions 
P(« | B;), and so articulated with F that E(f | B,, x) = Ef | B;) for 
every x such that P(x | B,;) > 0. The last condition is (7.2.7); as men- 
tioned in connection with that equation, the condition will in particu- 
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lar be met, if every f is constant on every B,, a specialization costing 
but little in real generality. 

The derived problem (paralleling § 6.2) consists of F(x), the set of all 
functions assigning elements f of the basic acts F to values x of the 
observation x. The values of E(f(x) | B;) for f(x) ¢ F(x) are computable 
from the E(f | B;) and the P(x | B;) thus: 


(1) E(£(x) | B;) = E(E(£(x) | B,, x)) 
= >) E(f(x) | By, 2) P(x | By) 


= )) E(f(z) | B)P(x| Bi) 


It will now be shown that the set of derived acts F(x) satisfies the 
technical conditions imposed on the set of basic acts F, so that the 
derived problem is also an objectivistic decision problem. In fact, if 
every f <¢ F is expressible in the form 2f(r)f, (with the usual condition 
on the f(r)’s), primary acts for F(x) analogous to the f,’s can be defined 
by attaching to every function r = r(x) an element f(x; r) of F(x), 
where 


(2) f(z; r) =ps f(z). 


There are only a finite number of f(x; r)’s, and all elements of F(x) are 
expressible as weighted averages of them; the first assertion is obvious, 
and the second poses the problem of finding, for any system of proba- 
bility measures ¢(r; x) on the r’s, at least one probability measure on 
the set of functions r with respect to which P(r(x) = r) = ¢(r; x) for 
every rand x. The problem typically has many solutions; the simplest 
is to let the r(x)’s, regarded for each x as functions of r, be independent 
random variables on the set of r’s considered as a probability space, 
that 1s, to set 


P(r) = JI ¢(r(z);2). 


Formally, this particular solution leads to the identity 


(3) f(z) = Do o(r; xf, 


= 2 I o(r(zx’); “| f,(2). 


The identity and the fact that the coefficients in braces are non-nega- 
tive and add up to 1, are easy to check analytically, if it is recognized 
that summation with respect to r means multiple summation with re- 
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spect to r(1), r(2), --- (the x’s being for definiteness supposed to take 
integral values). Equation (3) shows incidentally that it is immaterial 
whether it is before or after the observation that mixed acts are intro- 
duced. 

Turn momentarily to the idea of observation in group decision prob- 
lems. Here the E(f; B;)’s are replaced by I(f; 2)’s, the expected income 
of f in the opinion of the 7th person. There is no partition B;, except 
in a special, though theoretically important, case, namely that of the 
ith person holding unequivocally that B; obtains. 

The P(x | B;)’s are here replaced by P(x; 2)’s, the personal distribu- 
tion of x for the 7th person. It is postulated that, for each person, the 
conditional expectation of f is unaffected by knowledge of zx. 

The derived acts are formally the same as for an objectivistic decision 
problem, and the income function of the derived group decision prob- 
lem is 


(4) T(f(x); 4) = 22 T(£(x); 4) P(; 2). 


Returning to objectivistic problems, (9.4.1) defines the loss function 
of the basic objectivistic problem and, mutatis mutandis, that of the 
derived problem also, thus: 


(5) L(f(x); #) = max E@(x) | B;) — E(£(x) | B). 


The right side of (5) admits some simplification, for, if the person knew 
which B; obtained, observation would be valueless to him. Accord- 


ingly, 
(6) L(£(x); 7) = max E(f’ | B;) — E(£(x) | B). 


Analytically, the simplification is Justified thus: 
(7) max E(f| B;) < max E(£(x) | B,) 
f f(x) 


max > E(f(x) | B)P(«| Bi) 


lA 


max E(f | B;). 
f 


In discussing application of the minimax rule to the basic and de- 
rived loss functions, it is doubly advantageous to introduce mixtures 
of the 2’s, for thereby the theory of bilinear games presented in Chapter 
12 and that of partition problems (with some reinterpretation) can 
both be brought to bear. Letting 8 denote a generic system of weights 
B(t), B(t) > 0 and ZYB(7) = 1, and using the notation of Chapter 7, the 
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bilinear games associated with the primary and derived problems are, 
respectively, 


(8) Lif; 8) = (8) — E(£| 8), 
(9) L(£(x); 8) = 1(8) — E(£(x) | 8) 
= 1(8) — > > E(f(x) | B,))P(z| BB) 


I 


= 1(8) — >> E(£(z) | 8, x)P(z | 8). 


If necessary, (9) can be interpreted and verified by comparison with 
(7.3.7) and (7.2.8), in that order. 

In Chapter 7, 8(z) was generally required not only to be non-negative, 
but also strictly positive; on examination, this slight difference from 
the present context will be found innocuous. Again, in Chapter 7, the 
statement and derivation of conclusions were, for simplicity, nominally 
confined to twofold partition problems. Here the extension of those 
conclusions to n-fold problems will be freely used, though some readers 
may prefer here, as there, to focus on twofold problems. 

Letting L* denote the minimax (and maximin) value of the basic, 
and L*(x) that of the derived problem, it is obvious, since F(x) > F, 
that L*(x) < L*; but there is some interest in viewing this inequality 
as a consequence of (7.3.4): 


(10) L*(x) = max min L(f(x); 6) 
= max [1(6) — o(F(x) | 6)] 
< max [1(6) — o(F | 8)] 
= max main L(f; 8) = L*. 


It is clear that the maximin §’s for the basic and derived problems are 
the 6’s that maximize the concave functions 


(11) h(8) = ps 1(8) — v(F | 8) = (8) — k(6) 


and 


(12) (8; x) = ve (8) — v(F(x) | 8) = (8) — E(k((x)) | 8), 


respectively. The search for minimax f(x)’s, for example, is greatly 
narrowed by the consideration that, if f(x) is minimax, E(f(x) | 8) = 
v(F(x) | 8) for some 8, indeed for every maximin 8. According to § 7.3, 
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equality obtains in (10), if and only if there is a maximin Bp of the 
basic problem such that . 
P(zx | B,)Bo(7) 


(13) Bo(x) = ner Woes 


is also a maximin of the basic problem for every x such that 
=P(x | B;)Bo(j) > 0. 


The most typical possibility, and the only one to be explored here, is 
that the basic problem has a unique maximin 89 with Bo(7) > 0 for all 
j. Under this assumption, L*(x) = L*, if and only if x is utterly ir- 
relevant, as is easily shown. 

In the same spirit, as can easily be shown, L*(x) = 0, if x is defini- 
tive, but not typically otherwise; and, if x extends y, then L* (x) < 
L* (y) with equality if, and typically only if, y is sufficient for x. 


3 Sufficient statistics 


Digressing from the minimax rule for a moment, something more fun- 
damental can be said about a sufficient statistic y of x. Namely, for 
every f(x) ¢ F(x), there exists an f(y) eF(y) such that J(f(y); 7) = 
I(£(x); 1) for every 7. Indeed f(y) = >. f(x)P(x | y) defines such an 


¥ 7 
act. Without appeal to so weak a step as the minimax rule, this re- 
mark demonstrates that even an objectivist loses nothing by exchang- 
ing knowledge of an observation for knowledge of a sufficient statistic 
of it. The remark might as well have been expressed in § 7.4, except 
that there it would have involved some circumlocution, mixed acts not 
yet having been introduced. 


4 Simple dichotomy, an example 

Much of what has been said thus far is well illustrated by the mini- 
max counterpart of Exercise 7.5.2. The reader is accordingly asked to 
review that exercise and continue it thus: 
Exercises 

1. For the problem in question: 

(a) h(8) = 628(1) + 6:8(2) — | 6:6(2) — 628(1) |. 

(b) A(B; x) 
528(1) + 86(2) — Dd | 61728(2) — 82718(1) | {x P(r | B,)| 

r j 


= §9[2P(r) < 71*(8, Bo)| Bi) + P(r = r*(8, Bo) | By)18(1) 
+ 6:[2P(re < r2*(B, Bo) | Bs) + P(r = r*(B, Bo) | Be)]8(2). 
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2a. A 8 is maximin, if and only if r*(6, Bo) is such that 


(1) 62P(r, < 11*(B, Bo) | By) < 6:P(r2 < 171*(B, Bo) | Bo) 
and 
(2) 62P(r1 < 11*(B, Bo) | Bi) > 8:P(r2 < 11*(8, Bo) | Bo). 


2b. There is typically only one maximin, but there may be a closed 
interval of them. 


3. Though the acts of F and F(x) as defined by Exercise 7.5.2 do not 
provide for mixed acts, it will suffice to consider mixtures of the f(x)’s. 
Each of these will be determined by an i, and nothing will be lost by 
requiring i to be of the form 7(r(z)). 

4a. Any minimax will be equivalent to a mixture of f(x)’s each corre- 
sponding to a likelihood-ratio test associated with r*(8, 89) for every 
maximin ~. 

4b. In view of Exercise 3, the only likelihood-ratio tests that need 
be considered for a minimax B are: 


ar) = 1, if and only if r; < 71*(8, Bo). 
i(r) = 1, if and only if r; < 7r,*(@, Bo). 


These are not necessarily different tests. 

5a. If the maximin @ is unique, the minimax act is unique (except 
possibly for equivalent acts) and is a mixture of exactly two f(x)’s corre- 
sponding to the two likelihood-ratio tests defined in Exercise 4b. 

This conclusion calls for some comment, for, in ordinary statistical 
practice, one or the other of the extreme likelihood-ratio tests 1s used, 
never a mixture. This practice is not in serious conflict with the mini- 
max rule, because the maximum loss associated with either extreme is 
typically only slightly greater than L*(x). Moreover, vagueness about 
the exact magnitude of 6, and 62 would usually frustrate any attempt 
to calculate the coefficients of the mixture. Incidentally, mixture 1s 
not cailed for at all when r is continuously distributed, for h(@, x) 1s 
then smooth rather than polygonal; that is, if P(r = 7’ | B;) = 0 for 
every r’ and both 7’s, then h(6; x) has a continuous first derivative in @. 
To show this and to show that the derivative is 59P(r; < 1r;* | B,) - 
6,;P(r2 < ro* | B,) may be taken as an exercise only slightly beyond the 
usual mathematical level of this book. 

5b. If there is more than one maximin 8, then any one that is not 
extreme has only one likelihood-ratio test associated with it, and the 
same one for all. The f(x) corresponding to that test is essentially the 
only minimax. 
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5 The approach to certainty * 


In concluding the paraphrase of § 7.1-6 that has thus far been the 
subject of the present chapter, it should be mentioned that the approach 
to certainty studied in § 7.6 obviously implies that the corresponding 
L*(x(n)) approaches zero with increasing n. 


6 Cost of observation 


A cost c¢ associated with an objectivistic observational problem di- 
minishes the income by E(c | B;) for each 7, regardless of f; that is, al- 
lowing for the cost, J(f; 7) = E(f—c|B,). But the cost, being un- 
avoidable, does not affect the loss function, so the minimax problem 
associated with the observation is independent of the cost. The costs 
do intervene, however, in an essential way in the problem of deciding 
which to choose of several available observations, say X, at cost C,; it 
is important to bear in mind in connection with this problem that a null 
observation at zero cost is typically among the choices available in real 
life. The generic act of this compound problem can conveniently be 
symbolized by 2A(a)f(x,), or sometimes simply by \. Here, of course, 
A(a) > 0, ZA(a) = 1; for choice of \ means choice, for each a, of the 
probability \(a) that the ath observation x, will be made and also choice 
of the derived act f(x,) to be adopted in case x, is made. It is intuitively 
evident, and follows easily from (1) below, that the mixture of several 
\’s is also a \ as far as income is concerned, so mixtures of \’s do not 
require explicit consideration. The income function can be written 


(1) I(d; 2) = ZA(a)E(£(Xa) — Ca | Bi). 
Whence 

(2) max I(\; 7) = max Ef | B, — min E(ca | B,). 
The loss function is accordingly 

(3) L(A; B) = D2 (a) {La(£(Ka); 8) + da(8)}, 
where ; 

(4) da(8) =pt 2X {E(ca| B;) — min E(Ca | B:)}8(2), 


and L,(f(Xa); 8) is the loss function of the observational problem de- 
rived from the ath observation. 

The compound minimax problem is intimately related to the concave 
functions h(8; x,) and the linear functions d,(8), as is explained by the 
following exercises. 


+ Some recent references appropriate to this title are Blackwell and Dubins 
(1962), Chao (1970), Fabius (1964), and Freedman (1965). 
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Exercises 
1. Show that 
(5) hy\(8) =e min L(A; 8) = a [A(B; Xa) + da(@)]. 
2. If \ = 1-f'(x,-), then L(A; 8) = hy(6); if and only if: first, 
(6) La (f' (Xa); 8) = A(B; Xa’) 
(in which case f’ (xq) will be called well adapted to xq and 8); and, second, 
(7) h(B; Xa) + dar(B) = so [A(B; Xa) + da(8)] 


(in which case x,- will be called well adapted to 8). 
3a. Show that 


(8) [y* =ps min oe L(A; 8) = pr hy (8) 
< min max [h(6; Xa) + da(@)]. 
a B 


3b. Under the important special condition that the d,(8) are equal 
to constants dz, (8) specializes to 


(9) [y* < min [L*(x_) + dal. 


3c. When can equality hold in (8) and (9)? 

3d. B’ is maximin, if and only if h)(6’) = Ly*. 

4. AX = DA(a)f(xa) 18 minimax, if and only if: 

(x) For every a for which A(a) > 0, x, is well adapted to every maxi- 
min £, and f(x.) is well adapted to x, and every maximin £. 

(8) L(A; 7) < Ly* for every 7. (Of course (8) is alone necessary and 
sufficient; the point of the exercise is that the necessary condition (a) 
may conveniently confine the search for minimax }’s to relatively few 
candidates. ) 

5. Suppose that: (@) r and 7 are confined to the values 1 and 2, and 
L(f,; 2) = | r—-1 F (8) x is confined to the values 1 and 2, and P(1 | B;) 
= 1/2, P(l | Bz) = 1/4; (y) a is confined to the values 1 and 2, and the 
\’s of the compound problem attach weight \(1) to a basic act at zero 
cost and A(2) to an act derived from x at a non-negative constant cost 
d. Compute and graph: h(8), A(@; x), and (for various values of d) 
h,(8). Graph L,* as a function of d, and discuss the minimax )’s for 
various values of d. 


7 Sequential probability ratio procedures 


The type of decision problem that in § 7.7 led to the concept of a 
sequential probability ratio procedure has an intimate counterpart in 
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an important type of compound objectivistic decision problem, for 
which the concept was in fact originally developed by Wald [W2]. 
The x,’s of a problem of this type range over the enormous variety of 
sequential observational programs associated with a sequence of (con- 
ditionally) identically distributed random variables x(1), x(2), ---. 
The technical assumption that the a’s have a finite range is not fulfilled; 
but, as in § 7.7, I proceed with some lapse of rigor, referring to Wald’s 
book [W3] or [A7] for the full details. Exercise 6.4 shows that atten- 
tion may be confined to a’s that are well adapted to at least one 6, and 
that for those a’s it may be confined to f(x,)’s that are well adapted to 
x, and the corresponding 8. The way is paved by § 7.7, which states 
sharply restrictive properties of the x,’s and f(x,)’s that are so adapted. 
In some cases, recognition of these properties contributes greatly to the 
possibility of actually computing minimax, or nearly minimax, pro- 
cedures for sequential problems. 


8 Randomization 


Another important type of compound problem is illustrated by the 
second example of § 9.6. A generalization of part of that example is 
presented here to show how the minimax rule explains, or implies, the 
process called randomization, which is one of the most striking features 
of modern statistics, and one long antedating the minimax rule. Ran- 
domization represents the only important use of mixed acts that has 
thus far found favor with practicing statisticians, as will be discussed 
in the next section. The exact meaning of randomization seems a little 
elusive; no sharp definition is attempted here. But, roughly, random- 
ization is the selection of an observation at random; that is, of a d 
with more than one X(a) actually positive, the choice of the A(a)’s and 
of the derived acts being governed largely by symmetry. The follow- 
ing example provides at least a fairly general illustration of the concept. 

To set the stage and provide motivation for a formal statement, the 
example will first be stated in language that is suggestive though a 
little vague. The consequences of the basic acts in the example de- 
pend on the composition of a population of n objects, which may be 
thought of as numbered from 1 through n. It may be known of some 
compositions that they cannot occur; but, if a composition is considered 
possible, all populations having that composition (irrespective of order- 
ing) are also considered possible. Each observation in the compound 
problem consists in the cost-free observation of some m of the objects, 
every subset of exactly m objects being available for observation. 

Formally, the index 7 of the partition B; runs over a certain set J of 
n-tuples, {71, ---, tr}, of elements considered for definiteness to be in- 
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tegers. If 7 = {2,, ---, ¢,} eZ, then any permutation 77 of 7 is also in 
I. It 1s assumed that 
(1) E(f| B,) = Et | Bri) 


for every f ¢ F, 2 ¢ J, and permutation T. 

To every subset A of m integers, 1 < a,(A) < ao(A) <-++< Am_1(A) 
< dn(A) < n, there corresponds an observation x(A) the possible val- 
ues of which are m-tuples {27(A), ---, %(A)}. The conditional dis- 
tributions of the x(A)’s are defined thus: If 2,;(A) = 74,(4), etc., then 
P(2x;(A), ++, Bn(A) | B,) = 1. 

It is obvious that D*(x(A)) is the same for every A. In typical ap- 
plications this common value is little, if at all, less than L*. 

If a compound act 2A(A)f(x(A)) is to be chosen, statistical common 
sense asserts that nothing is to be lost by: 


—1 
(a) Letting (A) be independent of A, and therefore equal to a 


for every A; that is, letting every sample of size m have the same prob- 
ability of being chosen, or randomizing, as it is said. 

(b) Letting f(2z1(A), ---, %m(A)) be symmetric in its m arguments 
and independent of A. 


It can in fact be shown, by the method illustrated in the second ex- 
ample of § 9.6 and discussed more generally in § 12.5, that there is at 
least one minimax satisfying (a) and (b), and even that there is an ad- 
missible one. Typically, if m is large, but small compared to n, Ly* 
is much smaller than the common value of the L*(x(A))’s. 

The importance of randomization in applied statistics can scarcely 
be exaggerated. From the personalistic viewpoint it is one of the most 
important ways to bring groups of people into virtual unanimity; from 
the objectivistic viewpoint it not only makes possible great reductions 
in maximum loss, but 1t 1s seen aS an invention by which the theory of 
probability 1s brought to bear on situations to which probability on 
first (objectivistic) sight would seem irrelevant.+ 


9 Mixed acts in statistics 


Many have commented that modern applied statistics makes one, 
but only one, important use of mixed acts, namely in deciding, through 
the process of randomization, what to observe. Thus, for example, 
once the observation has been made, the derived act is in practice al- 
most always chosen, without mixing, from a set of basic acts natural to 
the problem. This might seem to imply a sharp conflict between the 
minimax rule and ordinary statistical practice; but actually it reflects 


+I would express myself very differently today (Savage 1962, pp. 33-34). 
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agreement, for mixed acts greatly reduce the minimax loss in decision- 
problem interpretations of typical practical statistical situations, when 
and only when ordinary practice calls for mixed acts of the same sort, 
namely when randomization is called for. 

There are certain mechanisms that systematically tend to make mixed 
acts have relatively little, or even absolutely no, advantage over un- 
mixed acts. In the following discussion of these mechanisms, let L(r; 1) 
be the abstract game on which a bilinear game L(f; g) is based. 

In the first place, supposing that L(r; 7) is non-negative for every r 
and 7 (as is appropriate to the context now at hand), (12.3.6) can be 
completed, so to speak, thus: 


(1) L* min (&, f) = min max L(r; 2), 


where R and J denote for the moment the number of values of r and 1, 
respectively, and min (R, J) is of course the minimum of the two inte- 
gers R and J. An inequality stronger than (1) will actually be proved. 

Consider a minimax f for which the smallest possible number FR’ of 
the f(r)’s are actually positive: 


(2) R'L* = max R! DO L(r; f(r) 


IV 


max L(r’; 2) 


> min max L(r; 7) 
where r’ is so chosen that R’f(r’) > 1, as can obviously be done. It is 
known [B19] that R’ < min (R, J). 

The important lesson of (1) is that, unless R and J are both large, 
the introduction of mixed acts cannot reduce the minimax Joss to a 
very small fraction of the value it would otherwise have. 

To mention a different mechanism, Figure 12.4.1 suggests that, if 
there are many 7’s, the corners of the concave function emphasized in 
that figure may well be very blunt, in which case a minimax mixed act 
has almost as high a maximum loss as any one of its components. When 
the number of 7’s is infinite, the concave function may well be differen- 
tiable, in which case mixed acts have absolutely no advantage. The 
remark appended to Exercise 4.5a 1s pertinent here. 

This mechanism can be related to a certain large class of infinite ab- 
stract (i.e., not necessarily bilinear) games, discovered by Kakutani 
(K1], for which L* = Lx. Bilinear games are but a special case of 
these, and numerous others seem to arise frequently in applications. 
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If L* = Lx for an abstract game, nothing at all can be gained by ad- 
joining mixed acts, as (12.3.5) shows. 

Finally, it may be mentioned that in many cases where an observa- 
tion x might be followed by a mixed derived act, the same, or nearly 
the same, consequences can often be realized by a pure act. Speaking 
a little loosely, this occurs whenever x has a continuous or nearly con- 
tinuous contraction y that is irrelevant, or nearly irrelevant, for then 
y can play the role in selecting a basic derived act that would otherwise 
be assigned to a table of random numbers. If, for example, x is con- 
tinuous, y(x) can be taken as the last few digits in the decimal expansion 
of x to an extravagant number of places. Again if, conditionally, x = 
{x}, °°:, Xn} is an n-tuple of continuously, identically, and independ- 
ently distributed real random variables, y(x) may be taken as the per- 
mutation that ranks the x’s in ascending order, provided that n! is 
fairly large: 10! should satisfy almost any need. 

A recent technical reference on the superfluousness of mixed acts in 
the presence of continuous observations is [D13]. 

I have occasionally heard it conjectured that any mixed act made 
after the observation (in an observational decision problem) is wrong in 
principle. I would argue that the conjecture is mistaken thus: Any ob- 
servational problem that calls for randomization can be simulated, so 
far as its loss function L(r; 7) is concerned, by a basic problem. A mixed 
act will be as appropriate to the basic problem as it was to the obser- 
vational problem from which the basic one was derived. In this way a 
great variety of situations calling for mixed acts having nothing to do 
with choice of observation can be constructed, though they seem to be 
atypical in practice. Moreover, any basic problem can obviously oc- 
cur as the decision problem remaining after some particular value x of 
an observation has been observed, so the situations Just constructed 
lead to closely related ones calling for mixed acts after observation. 

Less abstractly, consider a person choosing from a tray of assorted 
French pastries. Even after extensive visual observation and interro- 
gation of the waiter, the person might justifiably introduce considera- 
ble mixture into his choice. 

I think that the conjecture that mixed acts are necessarily imap- 
propriate after observations stems partly from the mechanisms that do 
tend to make such acts inappropriate or unimportant in many typical 
cases and partly from justifiable dissatisfaction with specific mixed acts 
that have from time to time been suggested by statisticians. For ex- 
ample, the suggestion that ties in rank arising in non-parametric tests 
be removed by ranking the tied observations at random may in many, 
or perhaps all, cases fairly be regarded with suspicion. 


CHAPTER 15 


Point Estimation 


1 Introduction 


This chapter discusses point estimation, and the next two discuss the 
testing of hypotheses and interval estimation, respectively. Definitions 
of these processes must be sought in due course; but, for the moment, 
whatever notions about them you happen to have will afford sufficient 
background for certain introductory remarks applying equally well to 
both kinds of estimation and to testing. 

Estimating and testing have been, and inertia alone would insure 
that they will long continue to be, cornerstones of practical statistics. 
Their development has until recently been almost exclusively in the 
verbalistic tradition, or outlook. For example, testing and interval 
estimation have often been expressed as problems of making assertions, 
on the basis of evidence, according to systems that lead, with high prob- 
ability, to true assertions, and point estimation has even been decried 
as ill-conceived because it is not so expressible. 

Wald’s minimax theory has, as was explained in § 9.2, stimulated in- 
terest in the interpretation of problems of estimation and testing in be- 
havioralistic terms; to objectivists this has, of course, meant interpre- 
tation as objectivistic decision problems. For reasons discussed in 
§ 9.2, it does seem to me that any verbalistic concept in statistics owes 
whatever value it may have to the possibility of one or more behavioral- 
istic interpretations. 

The task of any such interpretation from one framework of ideas to 
another is necessarily delicate. In the present instance, there is a par- 
ticular temptation to force the interpretation, namely, so that criteria 
proposed by the verbalistic outlook are translated into applications of 
the minimax theory, that is, of the minimax rule and the sure-thing 
principle (as expressed by the criterion of admissibility), for these are 
the only general criteria thus far proposed and seriously maintained 
for the solution of objectivistic decision problems. Of course it is to 
be expected, and I hope later sections of this chapter and the next dem- 
onstrate, that unforced interpretations do often translate verbalistic 
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criteria into applications of the behavioralistic ones. In evaluating any 
such interpretations, it must be borne in mind that an analogy of great 
mathematical value may be valueless as an interpretation; correspond- 
ingly, what is put forward as mere analogy should not be taken to be 
an interpretation, much less branded as a forced one. For example, 
attention has already been called (in § 11.4) to the danger of regarding 
the analogy between the theory of two-person games and that of the 
minimax rule for objectivistic decision problems as an interpretation. 
In fact, minimax problems are of such mathematical generality that 
they arise, even within statistics, in contexts other than direct applica- 
tion of the minimax rule to objectivistic decision problems; a striking, 
though technical, example is Theorem 2.26 of Wald’s book [W3]. 

The literature of estimation and testing is vast; indeed it has, I 
think, been seriously contended that statistics treats of no other sub- 
jects. This chapter and the next two cannot, therefore, pretend to 
present a complete digest of that literature, even so far as it pertains to 
the foundations of statistics. For further reading certain chapters of 
Kendall’s treatise [K2] may be recommended as a key reference to the 
verbalistic tradition (Chapters 17 and 18 for point estimation; 19 and 
20 for interval estimation; 21, 26, and 27 for testing). Many newer 
aspects are treated in Wald’s book [W3]; and a recent review of testing 
by Lehmann [L4] is recommended. 


2 The verbalistic concept of point estimation 


Abstractly and very generally, but in verbalistic language (which is 
necessarily vague), the problem of point estimation is this: Knowing 
P(x | B;) for every 7 and having observed the value x, guess the value 
\ of a prescribed function, or parameter as it is often called, A(z) with 
values in a set A. Semi-behavioralistically this is, I think universally, 
understood to mean that a function | associating a value I(x) ¢ A with 
each x (or possibly a mixture of such functions) is to be decided on, the 
function | being called an estimate (or, to be complete, a point esti- 
mate) of the parameter ». A problem of point estimation has, thus, 
some of the structure of an objectivistic observational problem; but, 
since nothing has yet been said about the income, or consequence, re- 
sulting from the act / in case B; obtains, it is at the moment impossible 
to advance criteria for the choice of 1. 


3 Examples of problems of point estimation 


It will now be well to present some examples after a few words of 
preparation. For simplicity, A will henceforth generally be supposed 
to be an interval (possibly unbounded) of real numbers. If A(z) = 
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A(z’) implies 7 = 2’, then » rather than 7 can be used to index the par- 
tition; such an estimation problem is said to be free of nuisance param- 
eters. This usage corresponds to the fact that the 2’s can typically be 
represented as ordered couples (A, 6), where A is of course A(z) and @ is 
called the nuisance parameter; if 6 in turn happens to be represented 
as an ordered n-tuple, ordinary usage calls 6 an n-tuple of nuisance 
parameters. It must be recognized as atypical in estimation problems 
for 7 or A to be confined to a finite set of values, and often x is not so 
confined either. It will therefore be necessary to proceed heuristically 
into domains where the mathematically limited theory developed in 
this book does not rigorously apply. 

The specific estimation problems most commonly cited as examples, 
and most important in practice, are summarized in Table 1, together 
with their maximum-likelihood estimates, that is, estimates constructed 
in accordance with a rule to be defined in § 4. All but the last two ex- 
amples of Table 1 are free of nuisance parameters. 


4 Criteria that have been proposed for point estimates 


As a matter of fact, verbalistic treatments typically do give some 
inkling of the consequence of the act / when B; obtains. Thus, in the 
examples commonly cited, such as those in Table 3.1, A is a set of real 
numbers or a set of n-tuples of real numbers and, therefore, a set of 
objects between which the notion of proximity. has some meaning. 
Work in the verbalistic tradition has made it clear in connection with 
such examples that, if / = A(¢) for the B; that obtains, the guess is 
considered perfect and that, roughly speaking, it is considered rather 
poor if J is far from A. 

In spite of the apparently hopeless indefiniteness of estimation prob- 
lems even as thus formulated, various criteria, or desiderata, for esti- 
mates have been suggested. A list of these criteria, intended to be es- 
sentially complete, is now presented. Each item is annotated and il- 
lustrated to make its meaning clear, and sometimes to call attention 
to related criteria not explicitly listed; motivation and criticism are, 
however, deferred until later sections, where they are treated in, connec- 
tion with explicit hypotheses about the consequences of misestimation. 

No attempt is made to include criteria like intellectual simplicity or 
facility of computation that depend not only on the estimate but also 
on the capabilities of the people who contemplate using it. The list 
is in a sense logically inhomogeneous. For example, no one really con- 
siders it a virtue in itself for an estimate to be a maximum-likelihood 
estimate (Criterion 4); rather, it is believed that such estimates do 
typically have real virtues. 
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It has, to begin the list of criteria, been suggested by one person or 
another that: 


1. If y is sufficient, nothing is to be lost by requiring the estimate 1 
to be a contraction of y. 


It will be instructive to bear in mind that necessary and sufficient 
statistics of the examples (a)-(f) in Table 3.1 are, respectively, z, z, 


i 2 Gy fy a 
2. If, of two estimates | and I’, 
(1) E(l — \@)P| By) < EW — P| Bi) 
for every 7, with strict inequality for some 7, then | is better than I’. 


There are countless variants of this idea. In particular, the square 
of the difference may be replaced by any other positive power of the 
absolute difference. Again, (1) may be imposed at only one value of 2, 
if 1 and 1’ are subjected to some other condition, freedom from bias 
(Criterion 6 below) being the popular one. 

Example (f) gives rise to a good illustration of this criterion, which 
is also interesting in a later connection. Letting Q =p; >. 22 — né2,” 
it is well known that E(Q |p, 0?) = (n — 1)o” and that E(Q?| pu, 0”) 
= (n? — 1)o*. Therefore 


(2) E([aQ — 07? | w, 6) = {a?(n? — 1) — 2a(n — 1) + ot 


(o-oo) 
° n+l i eerie 


20% 


n+) 


for all real a, with equality if and only if a = (n + 1)7!, omitting the 
pathological but trivial case that n = 1. By the criterion in question, 
Q/(n + 1) is therefore better than any other estimate of the form aQ, 
including the maximum-likelihood estimate Q/n and the unbiased es- 
timate Q/(n — 1). 


3. If, of two estimates 1 and I’, 
(3) P(-—a < U(x) — A) < «| B) > P(-a <U(z) — A) < @| By) 


for every non-negative e, and eg and for every 7, with strict inequality 
for some €1, €g, and some ?, then | is better than I’. 


+ This example was given by Leo A. Goodman (1953). 
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Acceptance of this criterion 1s obviously implied by acceptance of 
Criterion 2, of which it may therefore be regarded as a skeptical] coun- 
terpart; formal demonstration of a much more general assertion will be 
given in connection with (5.2—4). The criterion implies, for example, in 
connection with (c) of Table 3.1 that ¢ is superior to any other weighted 
average of the x,’s. A more interesting example will be mentioned in 
connection with Criterion 5. 

That modification of Criterion 3 in which it is concluded only that 
lis at least as good as I’ is of some technical interest. Incidentally, if 
equality held identically in (3), there would presumably be nothing to 
choose between the two estimates by any reasonable criterion, for they 
would then both have the same system of conditional distributions. 


4. A maximum-likelihood estimate is often a rather good estimate. 


A maximum-likelihood estimate is an estimate | such that, for some 
function i of x, l(x) = A(z(x)) and 


(4) P(x | Byzy) > P(x | Bi) 


for every 7 and x. In many natural problems there is only one maxi- 
mum-likelihood estimate. Taking into account the analogy between 
probabilities and values of probability densities, the reader should verify 
that the estimates listed in Table 3.1 are indeed the unique maximum- 
likelihood estimates of the problems to which they refer. When there 
is a unique maximum-likelihood estimate, it is obviously a contraction 
of the likelihood ratios and, therefore, of any sufficient statistic; which 
fits neatly with Criterion 1. 


5. A good estimate should have the same symmetry as the problem. 
More precisely, if a permutation T of the 7’s and the x’s is such that 
(5) P(Tx | Br.) = P(x| Bi), 


and such that A(z) = A(z’) implies A(T7) = A(T2’); then 1 should be 
such that, if l(7) = A(z), U(T'x) = A(T2). 

For example, adopting also Criterion 1, a good estimate for yu in (c) 
may be sought of the form /(#). Symmetry then dictates l( + a) = 
l(é) + a and l(—#) = —l(#); in short, l(%) = Z. 

The same conclusion can be drawn for (e), though with a little more 
trouble. The criterion applied to (f) leads to estimates of the form aQ. 
The constant a might be fixed by appealing, for example, to Criterion 
2, 4, or 6. These alone give three slightly different determinations— 
a~' = (n+ 1), n, and (n — 1), respectively. 
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Again, it can be shown for Examples (c) and (e) that, among all es- 
timates satisfying Criterion 5, £ is best according to Criterion 3. 


6. It is desirable that the estimate be unbiased. 
An estimate | is called unbiased, if and only if 
(6) E(| B,) = d(2) 


for every 1. 

It is easy to verify that the maximum-likelihood estimates of (a)—(e) 
in Table 3.1 are all unbiased; that of (f), however, is not, for E(Q/n | L, 
o”) = (1 — 1/n)o” instead of o*. Again, if 1 is a maximum-likelihood 
estimate of \, e' is a maximum-likelihood estimate of e*. But, if 1 is 
not definitive, and 1 is an unbiased estimate of \, ¢' is not an unbiased 
estimate of e*, as Theorem 1 of Appendix 2 implies. 


7. If P1—@®)| < | — x) || By > 1/2 for every ¢, then 1 is 
better than I’. 


Any resemblance between this criterion and Criterion 3 seems to be 
dispelled by the following example. Suppose that, for every 7, P(Il — A(z) 
=a, l’— dA) = 5 | B;) equals 2/11 if a and 6 are integers such that 
0<a<b < 2, equals 5/11 if a and b are 2 and 0 respectively, and 
equals 0 otherwise. According to Criterion 7, 1 is better than 1’, be- 
cause 6/11 > 1/2; but, according to Criterion 3, 1’ is better than 1, 
because 5/11 > 4/11 and 7/11 > 6/11. The example can easily be 
modified to suit any taste for symmetry and continuity. But, if 1 and 
l’ are conditionally independent (which is not a natural assumption), 
and | is better than 1’ according to Criterion 7; then, as may easily be 
shown, |’ cannot be better than | by Criterion 3. 


The list of criteria is here interrupted by several paragraphs of ex- 
planation in preparation for two concluding criteria. 

The approach to certainty treated in §§ 3.6 and 7.6 has its counter- 
part in the theory of estimation. In particular, if x(n) = {x,, ---, Xn} 
is an n-tuple of conditionally independent and identically distributed 
observations, there will typically exist sequences of estimates I(n) based 
on x(n), such that 


(7) lim P(| (x(n), n) — A(t) | < «| B) = 1 
for every positive « and every 7. A sequence of estimates satisfying (7) 


relative to any sequence of observations x(n) (not necessarily n-tuples 
of conditionally independent observations) is called consistent. 
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The condition of consistency is often realized in a very special way, 
namely that the error [l(x(n); ») — A(z)] 1s, for every B; and for large 
n, practically normally distributed about zero with variance inversely 
proportional to n. More formally, a sequence of estimates may be 
such that 


(walla n); 2) — 0] a ae ee 
(8) tin p(T sal B)- aoa dz 


for every 7 and a, where o(2) is some positive function of 7; it is then 
said that n”™[I(x(n); n) — X(2)] is asymptotically normal about zero with 
asymptotic variance o*(i). If, in addition, for every 7, o°?(z) is not less 
than a certain function, the differential information, to be defined in 
§ 6, then the sequence 1, is called efficient. 

There is a possible pitfall in connection with the idea of asymptotic 
normality. Though (8) implies that, for large n, the distribution of 
the error is, in a sense, almost the normal distribution with zero mean 
and variance o7(7)/n, it does not imply that the mean of the error is 
close to zero, or even finite or well defined. Similarly, the variance of 
the error may be much larger than o7(7)/n, infinite, or ill defined; but 
it cannot, for large n, be smaller than o7(1)/n by a fixed fraction or less. 

Much literature on estimation has concentrated on sequences of es- 
timation problems in which x(n) is an n-tuple consisting of the first n 
elements of an infinite sequence of conditionally independent and con- 
ditionally identically distributed random variables or, as it will be 
called in the present chapter, a standard sequence; because these are 
the simplest examples of sequences of increasingly informative obser- 
vations. Examples (c)-(f) in Table 3.1 refer directly to standard se- 
quences; the binomial distributions (a) can be regarded as the distri- 
bution of the sufficient statistic > x, of the standard sequence x(n) 
in which each x; takes the values 1 and 0 with probabilities p and 1 — p, 
respectively (cf. Exercise 7.4.1); again, if each x; is Poisson-distributed 
with parameter u, then >> x; is sufficient for x(n) and is itself Poisson- 
distributed with parameter nu. Thus, all the examples in Table 3.1 
give rise more or less directly to examples of standard sequences. 

In speaking of standard, and occasionally of other, sequences the 
ellipsis of referring to a sequence of estimates simply as ‘‘an estimate’”’ 
has been widely adopted, so one reads recommendations that ‘‘an es- 
timate’ should be consistent or efficient. This ellipsis, though often 
convenient, sometimes proves dangerous. It distracts from the fact 
that a person is called upon to make an estimate, not a sequence of es- 
timates; so that the question of what constitutes a good sequence does 
not arise. Again, it makes one feel that if an estimate, say 13, has been 
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defined for x(13), then the definition of 1,4 is thereby implied. One for- 
gets, for example, that ‘‘the’’ average of n observations is a whole se- 
quence of statistics, a sequence singled out by human tastes and in- 
terests, rather than by any mathematical necessity. In short, the 
ellipsis establishes the atmosphere of the logically nonsensical (though 
perhaps psychologically revealing) questions on intelligence tests such as: 
‘‘What are the two missing terms in the sequence __ __ 1 828 1828?” f 

The recommendations of consistency and efficiency quoted above can 
be added to the numbered list of suggestions, in a form that avoids the 
ellipsis: 


8. If each I(n) is a good estimate for the corresponding x(n) of a 
standard sequence, then the sequence 1(n) is consistent. 


The sequence of maximum-likelihood estimates of the sequences of 
problems (a), (c)—-(f) are consistent; and, for the sequence of problems 
of estimating from an observation y, Poisson-distributed with parame- 
ter nu, the maximum-likelihood estimates y,/n are consistent. 

If there is one consistent sequence of estimates, for a sequence of 
problems there is a plethora. Each term of a consistent sequence can, 
for example, be multiplied by (1 + n~™) without destroying consist- 
ency. Again, the sample medians { are in (c) a consistent sequence 
different from the sequence of maximum-likelihood estimates. 


9. Under the hypothesis of Criterion 8, the sequence 1(n) is efficient, 
at least if any efficient sequence of estimates exists. 


The six sequences of maximum-likelihood estimates mentioned under 
Criterion 8 are all well known to be efficient, as sequences of maximum- 
likelihood estimates for standard sequences typically are. The asymp- 
totic variances and certain other interesting quantities associated with 
these six sequences are presented in Table 1. It is remarkable that, 
for each of the examples in Table 1, the expected values of the estimates 
approach the estimated parameter; n times the variance of the esti- 
mate, and n times the expected squared error, both approach the asymp- 
totic variance of n” times the error. For the first five examples the 
relations mentioned hold, indeed, not only in the limit, but exactly, 
for all n. All six examples are rather special, or magical, but the limit- 
ing relations just mentioned may fairly be expected to hold in some 
generality, though they are not (as has already been mentioned) really 
implied by the asymptotic normality of the sequence of errors times 
n”, To illustrate the exceptions that can occur, rE i is, in (c), the 


te = 2.7182818285 to eleven significant figures. 
t See any statistics text for definition, if necessary. 
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maximum-likelihood estimate of | mM = for » ~ 0; this sequence of es- 
timates is efficient; and n”%(|z@|—! — | «|—!) is asymptotically normal 
about zero with asymptotic variance u—*; but the other three entries 
for Table 1 are infinite in this example. 


TABLE 1. EXAMPLES OF BEHAVIOR OF MAXIMUM-LIKELIHOOD ESTIMATES 


Asymp- 
n X expected _ totic 
Sequence Mean n X variance square of = variance 
error of n% xX 
error 
(a) p pq pq pq 
Poisson pn 7 Ub Ub Ub 
(c) ub 1 1 1 
(d) o? 20% 204 204 
(e) yb o o” o? 
1 1 1 
(f) (1 — -) o 2(1 — - ) o* (2 -=) o* 204 
n n n 


As in the case of consistency, where there is one efficient sequence, 
there are many, but efficiency is, of course, a much more restrictive 
property than consistency. For example, multiplication by (1 + n~”%) 
typically destroys efficiency, though multiplication by (1 + n~') never 
does. Again, the consistent sequence of medians mentioned under Cri- 
terion 8 is not efficient. Indeed, it is well known of that sequence that 
the sequence of errors times n™” is asymptotically normal about zero 
with asymptotic variance 7/2 rather than 1. 


5 A behavioralistic review of the criteria for point estimation 


It is time now to introduce the notion of consequences, or (equiva- 
lently, I believe) of loss, thereby interpreting estimation problems as 
decision problems. Let it be said then that an estzmation decision prob- 
lem is an observational decision problem with the following distinguish- 
ing feature. There is a one-to-one correspondence between the basic 
acts f and the values attained by a real-valued function A(z), such that 
Lf; 2) = 0, if f is the act that corresponds with A(z). It is simpler, 
more suggestive, and harmless to let the number / that corresponds to 
f replace f itself in all further discussion of estimation decision problems. 
To illustrate the new notation, it may be said that L(l;7) = 0, if l = A(z). 

I believe that any situation ordinarily said to call for (point) estima- 
tion can be analyzed as an estimation decision problem. For example, 
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estimating how much paint will cover a wall may, depending on cir- 
cumstances, mean deciding: how much paint to buy, what to bid for a 
contract, or what number to enter in a guessing pool. Under each of 
those interpretations there will be zero loss, if and, typically, only if 
the estimate is ‘‘correct,’’ as one says. 

The consequences of an estimate may, like those of many real life 
decisions, be difficult to appraise. It is hard to say even in relatively 
concrete situations what it will cost to misestimate the speed of light, 
a particular mortality rate, or the national income. If, to revert to an 
example already discussed, the estimate is to be published somewhere 
for the use of whoever has a use for it, the consequences of publication 
may seem beyond all reckoning. None the less, I reaffirm the convic- 
tion that the concept of consequence measured in income or loss is 
valuable in dealing with such situations, as I hope the present treat- 
ment of estimation will illustrate* Incidentally, it seems indifferent, 
as I have already said, whether loss or income is taken as the starting 
point. It is easily shown that the decisions of the idealized person of 
the personalistic probability theory will be the same in two problems 
having possibly different income, but the same loss, functions. This 
feature I would expect to be acceptable even to objectivists, and I 
also think it appropriate to theories of group decision. 

I know of nothing interesting that distinguishes estimation decision 
problems as a class from observational decision problems generally. 
But actual estimation situations suggest certain relatively wide classes 
of estimation decision problems about which interesting and valuable 
conclusions can be drawn. Indeed, it will be shown in this and the next 
two sections that seven of the nine listed criteria for estimation can be 
justified to some extent as flowing from application of the principle of 
admissibility and the minimax rule to such classes of estimation de- 
cision problems. 

Before making any real specialization, it may be most systematic to 
mention that Criterion 1 is simply an instance of the general principle, 
which we have now studied from several points of view, that nothing 
is lost by confining attention to sufficient statistics, at least if mixtures 
are allowed. 

It is clear in almost any estimation situation, even in those for which 
the notion of Joss is vaguest, that if two errors have the same sign the 
larger entails at least as great a loss as the smaller. Analytically, 


(1) Lil; 2) < LU; 2) 


for A(i) <1 <I’ and for AV) > 1 >’. Situations to which (1) fails 
to apply can readily be imagined. William Tell, for example, in esti- 


+ This idea was expressed by Gauss (1821, Section 6). 


15.5] BEHAVIORALISTIC REVIEW OF ESTIMATION 231 


mating the angle by which to elevate his cross-bow for the apple shot 
might have preferred a downward error of 10° to one of 1°; but such 
circumstances seem exceptional. Furthermore, it is usually justifiable 
to assume that strict inequality holds in (1), though there are many 
exceptions in which, for example, ‘“‘a miss is as good as a mile” or one 
hit is as good as another. 

As is, I think, intuitively evident, when strict inequality holds in 
(1), Criterion 3 is simply an application of the principle of admissibility. 
That conclusion can be shown in complete generality without serious 
difficulty, but, in compliance with the usual mathematical limitations 
of this book, it will here be shown only under the assumption that x 
is confined to a finite number of values. 

What is to be shown is this: If 1 and I!’ are a pair of estimates satisfy- 
ing the hypothesis of Criterion 3, and if (1) holds with strict inequality; 
then L(l; 7) — L(’; 7) < 0 for every 7, with strict inequality for some 
1. To begin the proof calculate thus: 


(2) LA;2) — LW;2) = LG; d[Pd@) = 1| BY) — PU'(a2) = 1| Bd] 
l 


> L(t; Q(; 2) 
l 


X LU; )Q0;) + VY LG; eG 9, 

L<d(4) L>X(t) 

where the definition of Q(l; 7) is clear from the context, and where it 
has been taken into account that L(A(2); 7) = 0. It will be shown that 
both sums in the last part of (2) are non-positive and that for some 7 at 
least one of them is negative. Focus, for definiteness, on the second 
sum. Let Jp = A(z) and 1,, le, --- be, in order of increasing magnitude, 
the values of 1 > A(z) for which Q(l; 72) = 0. With the abbreviations 
L(k) =pe LU; 2), ACK) =pe L(A) — L(k — 1), and Q(k) = v1 Qh; 2), 
the sum to be investigated is 


(3) > Leh = >) Qk) DI AK’) 


0<k 0<k 0<k’ Sk 


= 2) Ak’) DF Qh). 
0<k’ kz k’ 
(This rearrangement may seem bizarre on first encounter, but it is 
widely used in mathematics generally and is in fact an exact analogue, 
for sums, of the more familiar integration by parts, for integrals.) It 
follows from (1) read with strict inequality that A(k) > 0; and it fol- 
lows from the hypothesis of Criterion 3 that Q(k) < 0, and that some 
Q(k)—or an analogous term associated with the first sum in the last 
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line of (2)—1is strictly negative for some 7. This completes the deduc- 
tion of Criterion 3 from the strict form of (1) and the principle of ad- 
missibility. Essentially the same argument leads from (1) as actually 
written to the modification mentioned in the note under Criterion 3. 

A very slight strengthening of (1), together with the minimax rule, 
provides a widely applicable justification of Criterion 8 (consistency), 
as will now be explained. Suppose that (1) not only holds but also is 
strict, if J = A(z); that is, in addition to (1) suppose only that L(l’; 2) 
> 0 for all l’ ¥ A(z). In this context, let x(n) be a sequence of obser- 
vations such that the minimax L*(n) of the corresponding estimation 
problems approaches zero with increasing n; then any sequence of mini- 
max estimates l(n) is consistent. Indeed, if the sequence I(n) is not 
consistent, then, for some 7, and some positive e and 6, 


(4) P(| Un; n) — Ma) | > €| By) > 6 
for some arbitrarily large values of n. This implies 
(5) L*(n) > Lil(n); 2) = 6 min {LA(2) + €; 2), LA) — €;2)} > 0, 


which contradicts the hypothesis. 

Turn next to Criterion 5 (symmetry). Suppose that the estimation 
decision problem has symmetry in the sense defined under Criterion 5. 
That does not in itself really call for estimates with the same symmetry. 
But, if L also has the symmetry, that is, if L(A(2’); 7) = LACT’); 72) 
for all appropriate 7’, then the discussion of symmetry in § 12.5 sug- 
gests that typically there is, at any rate, a symmetrical, admissible, 
minimax estimate. Whether L has the requisite symmetry is a ques- 
tion that can often be answered without detailed knowledge of L. 

It is often justifiable to suppose that the function L(l; 7) is smooth 
enough to be differentiated twice with respect to J, at least when 1 is 
near A(z). This condition, though very often met, is not quite so de- 
void of content as it may seem to a reader brought up in the tradition 
that it makes no practical difference whether a function has a few sharp 
corners because they can always be rounded off with almost no change 
in the function. If, for example, Z(/; 7) is for all practicable purposes 
equal to |7 — |; then L cannot be regarded as differentiable even 
once when / = 4, and the theory to be developed here for twice differen- 
tiable L(l; 2)’s in the presence of extensive observation does not apply. 
It will therefore be useful to digress to the consideration of an example, 
illustrating how corners can arise and the phenomena that tend to round 
them off. 

Suppose that a person must estimate the amount d of shelving for 
books, priced at $1.00 per foot, to be ordered for some purpose. It is 
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possible that the following economic analysis of the situation would be 
sufficiently realistic. The person holds every foot of shelving less than 
the number of feet, A, uf books to be shelved to be worth $a, a > 1, 
but superfluous shelving he holds to be worthless. Formally, 


(6) Li; dA) = (ea -DA-—-D) forl <a 
= (l— ) forl > . 


There is then a corner, or kink, at 1 = \; so differentiation, even once, is 
impossible. 

But the following analysis is much more likely to be sufficiently real- 
istic. The urgency of the shelving of the books is variable. Some would 
be worth shelving, even if the cost of shelving were very high; at the 
other extreme, there are some that would not be worth shelving unless 
the cost were very low. More fully, the value of / feet of shelving is a 
function 7(/) that presumably has the following features. It is mono- 
tonically increasing, strictly concave, and twice differentiable in 1; 
1(0) = 0; 7(0) < «; 7’(0) > 1. The income attached to ordering L 
feet of shelving, at the price $1.00 per foot, is clearly 


(7) Il; 2) = 2(l) — 1. 


It is maximized at the one and only value A for which dz(A)/dA = 1, so 
that 


(8) Lil; 2) = [t) — A] — XD) — O, 


which is of course twice differentiable in lJ. 

The moral of these two possible economic analyses of one example is 
of wide applicability, as is well known among economists. Where a 
superficial analysis suggests a kink, or even a discontinuity, in an in- 
come function, deeper analysis will often show that the function is 
smoothed out by various economic phenomena such as the inhomo- 
geneity and the mutual substitutability of commodities. 

To return from the digression, if L is twice differentiable in 1 (at 
least when 1 is close to A), L can be expanded in a Taylor series thus: 


6) LDS CSn 16> 
al IG) 
+ z (l — n)? e L(1; 2) + o((l — d)?) 
2 a” Sn cay j 


where, following standard usage, o((1 — )*) is a function of J and 7, not 
necessarily the same from one context to another, such that o((J — A)”) + 
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(lL — \)? approaches zero as 1 approaches \(z) for fixed 7. The first term 

on the right side of (9) vanishes by the definition of estimation; the 

second must vanish also, for otherwise L could be negative. Therefore, 
Sy seh 3? 

(10) Lil; 1) = ; (i — 2) ae t)| + 0((l — )*) 

l 


= (1 — X(2))*a(z) + o((l — d)?), 


where a(z) is defined by the context. 

In view of (10), it is plausible that L may, in many problems where 
estimates of great accuracy are possible, be supposed to be practically 
of the form 


(11) Lil; 7) = (1 — A(@2))’a(2), 


where a(z) > 0 for every 7. This does not exactly mean that a reason- 
able LZ can be closely approximated by functions of the form (11) for 
all 7. In particular, the absurd assumption that L is unbounded (which 
such approximation would typically imply) is not to be made. It means, 
rather, that under favorable circumstances (11) may lead to a reason- 
ably good evaluation of L(1; 7). In so far as the form (11) can be sup- 
posed adequately to represent ZL, Criterion 2 is obviously an applica- 
tion of the principle of admissibility. An interesting discussion and 
application of (11) is given by Yates [Y2]. 


6 A behavioralistic review, continued 


Thus far, Criteria 1, 2, 3, 5, and 8 have been discussed in behavioral- 
istic terms. In fact, under suitable hypotheses, each has been found to 
have considerable behavioralistic justification. Criteria 4 and 9 also 
have such justification, but my discussion of them is so bulky it had 
better be isolated in a special section. As for Criteria 6 and 7, the only 
ones remaining, they do not seem to me to have any serious justifica- 
tion at all, as will be discussed in still another section. 

Criterion 4, the recommendation of maximum-likelihood estimates, is 
of extraordinary interest, for, of all the criteria of the verbalistic tradi- 
tion, it is essentially the only one that selects a unique estimate in al- 
most every estimation situation of practical importance. The present 
section demonstrates that, in the presence of extensive observation, 
maximum-likelihood estimates are often almost minimax estimates; it 
also gives some analysis of Criterion 9, which refers to efficiency. The 
way to these goals is roundabout; it begins with a study of information 
in the technical sense mentioned in § 3.6. In this section it will be as- 
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sumed for mathematical simplicity that each observation under discus- 
sion is confined to a finite number of values, each having positive prob- 
ability for every element of whatever partition is under discussion. 

If B; and B; are elements of a partition, not necessarily finite, and x 
is an observation, say, in the spirit of (3.6.11), that the information of 


7 relative to 1 for the observation x is 
rj 
B;) = —E | log —| B; }- 
r; 


The expression of J in terms of likelihood ratios is important, especially 
for the extension of the discussion to more general observations than 
those contemplated here. The reader should, therefore, try to bear in 
mind that the whole discussion could be carried on in terms of likeli- 
hood ratios; I refrain from so doing only for momentary reasons of no- 
tational convenience. The theory of J can conveniently be presented 
in a series of exercises. 


P(x | B;) 
P(x | B,) 


(1) J(t,j;x) =pe -B (log 


Exercises 


la. If y isa contraction of x, thenJ(z,7;x) > J(t,7;y). With equality 
when? Hint: 


P(x | B; P(y | B; 
(2) a (10¢ Poel B) B;, y) > —log Pul Bi) 
P(x | Bi) Pty | Bi) 
lb. J(t,7;x) > 0. With equality when? 
2a. If xX], ---, X, are conditionally independent, then 
(3) J, 5; 1, 00+, Xn) = DIGG, 5 2) 


2b. If in addition the x,’s are conditionally identically distributed, 
then 


(4) J (i, J; Xi, °°"; Xn) = nJ (2, VE X}). 


It is interesting to evaluate the information J(A, \ + Ad; x) where A 
and \ + Ad are two closely neighboring values of the parameter of an 
estimation problem, supposed, for simplicity, to be free of nuisance 
parameters. If P(x | A) is continuous in X, it is almost obvious that 
J(A, \ + AA; X) approaches zero as Ad approaches zero. If P(x | A) 18 
differentiable in , it is easy to show further (considering that J is non- 
negative) that even J(A, A + Ad; x)/A\ approaches zero as AA ap- 
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proaches zero. But in this case much more can and will be shown, 


namely, 
J(A, \ + AA;X 1 
(5) lia F(A, d+ Ad; x) =e 
Ay 0 A)? 2 


2 
a. 8|(° log P(x | a | 
2 an 


The function H is generally, following Fisher, called information, but 
here we had better call it differential information. Chronologically, as 
explained at the end of § 3.6, the concept of differential information is 
older than that here called simply information and of which it is, ac- 
cording to (5), a limiting case. 

The demonstration of (5) begins with the consideration that 


H(A; x) 


(6) log (1 + #) = ¢t — $2? + o(#?). 
Therefore, 
P(z|A+ ddr) | | P(x| + Ad) — P(x| »| 
(7) 8 Daly) = log 1 + ~—Boely ) 
7 fa d+ Ad) — P(x| »| 
7 P(x|) 
1 a baaias nek ; 
= =| Poel y + o(And~*). 


Since the expected value given \ of the term in the second line of 
(7) is easily seen to be exactly zero, it will be tactful to leave that term 
alone; but the second may be approximated thus: 


P(x| + Ad) — FEIN" Festa tte ' 
| P(a |») ~ | P(w|r) ar ey) 
2 
= an2{ 2108 FEIN) + 0(Ad?). 
Therefore, 
(9) J(A, \ + AA; x) = GH(A; x)AM + 0( Ad’), 


which establishes (5). 


More exercises 


3. If the kth derivative (k > 0) with respect to \ of P(x | A) exists 
for every x, then 


1 o 3° 
(10) Esty ae Pel | s) = S (x Pel »)) = 0. 
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4. If the requisite second derivative exists, then 


07 
11 H(\;x = -8 (510 P(x| 2 \): 
(11) (A; x) si log P(x») | 
5. If y is a contraction of x (and H(A; x) is well defined), then H(A; y) 


< H(A; x). 

Remark: The inequality is obvious in the light of Exercise 1a and the 
first part of (5). But it can also be derived from the following applica- 
tion of Theorem 1 of Appendix 2, which is useful in the next exercise. 


2 
ae 1 PUN at( 1 PEIN | yx) 
P(y|r) an P(x|d) aa 
wr), 


( 1 aP(a| yy 
<& a 
P(z|r) a 


0 
with equality for every y and X, if and only if ay log P(x | \) can be ex- 


pressed as a function of y and X alone. 

6a. If y is a contraction of x, H(A; x) = H(); y) for every d; if and 
only if y is sufficient for x. 

6b. H(A; x) = 0 for every X, if and only if x is utterly irrelevant. 

7a. If X;, ---, X, are independent given X, then 


(13) H(A; X1, ++, ¥n) = 2, H(A; %). 


7b. If, in addition, the x,’s are identically distributed given \, then 
(14) H(a; Xi, °°", Xn) a n(n; X). 


8. If 1 is a real-valued contraction of x, and H(\; x) is well defined, 
then 


(a) 
a Z A ALLEY 
(15) x E(\|\) = £ (ux re r 
(b) 
d 2 
(16) Bl — xP | M050 > |< Bal}, 


with equality if and only if 
;) 
(17) > 108 P(l| x) = @— dk 


for some constant k. Hint: Use Exercise 3 and apply the Schwartz in- 
equality to (15). 
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(c) If H(A; x) > 0, then 


2 
(18) E( — »P| a) > \— E(| »} /H(; x). 


Exercise 8c is an important, and now famous, inequality. It, together 
with its n-dimensional generalization, has been called the Cramér-Rao 
inequality because of its independent publication by Rao and Cramér 
in 1945 and 1946 respectively (see [H6]). But the name is not at all 
well justified historically. Fréchet presented the inequality in 1943 
[F8], and Darmois extended Fréchet’s inequality to n dimensions, at 
least for unbiased estimates, in a publication [D1] not later than Rao’s. 
The inequality has also, though I think erroneously, been attributed to 
an early paper by Aitken and Silverstone [Al]. and to one by Doob 
[D10]. My point is, of course, not to give a definitive history of the in- 
equality, but merely to suggest that for the time being an impersonal 
name would be better. I tentatively propose calling it the :nformation 
inequality. Some recent references pertinent to the information in- 
equality and other topics treated thus far in this section are [W15], 
[M5], [C6], and [H6]. The techniques used in the remainder of this 
section, which revolve around the information inequality, were pub- 
lished posthumously by Wald [W5]. 

The information inequality has an important bearing on application of 
the minimax rule to estimation, of which the following theorem may, 
in view of (5.11) be taken as a first illustration. 


THEOREM 1 


Hyp. 1. For every ) in a closed interval of length 6, H(A; x) < H, 
where H is a constant. 
2. lis a real-valued contraction of x. 


ve ie 
CONCL. For some J in the interval, E((1 — \)? | A) = (1 ‘ne ;) 


Proor. Suppose that the theorem is false. Then according to Ex- 
ercise 8c, 


(19) 1> w*(H¥ +") > = aa ») 
5 dd 


for every \ in the interval. Therefore, 


2 


(20) Zi — FA|x]>1- ws + \ - eee eae 
dd 6 (6H’* + 2) 
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for every \ in the interval. Therefore, at one end of the interval or 
the other, 


2 5 \~! 
21 \ — E(1| \) | > —————- -- = (H” -) 
(21) | (1] A) | GHEE D 2 ( a 


This leads to a contradiction through the well-known inequality 
(22) E((l — v? |) > {FA — a] a}? = [a — £0] a) |, 


which can be derived as a direct application of Theorem 1 of Appendix 
2, or of the Schwartz inequality, or of the useful identity 


(23) E(l— a}? |) = Vda) + {Fd -A]a)}2.@ 
In the remaining portion of this section, let it be understood that: 


1. The x,’s are an infinite sequence of observations that are, given X, 
identically distributed and independent. 

2. x(n) = {x,, ---, Xn} forn = 1, 2, ---. 

3. l(n) is a real-valued contraction of x(n). 

The contraction I(n) is to be thought of as an estimate of \ based on 
observation of x(n). In the spirit of the minimax theory it is really 
mixed, rather than ordinary, estimates that should be treated here. 
But this entails no essential change in the following discussion once it 
is recognized that a mixed estimate is, in effect, an ordinary estimate 
based on observation of y(n) = pgs (I(m), x(n)), where x(n) is sufficient 
for y(n), so that H(A; y(n)) = H(A; x(n)) for all X. 

4. e and 6 are positive numbers. 

5. Ao is a closed interval of length 6 contained in the range of » and 
including a given value Apo. 


The next theorem shows that, if L(/; \) is of the form (5.11), L(l(n); 
) cannot ordinarily be kept much smaller than a(Ao)/nH (Ao; X,) for 
large n, even in a small interval about Ao. 


THEOREM 2 If H(A; xX;) is continuous and positive at Ao, and if 
a(\) is a non-negative function continuous at Ao, then, for sufficiently 
large n, E((I(n) — A)?a(d) | ) > (1 — e)a(Ao)/nA (Ao; X1) for some 
rN é Ao. 


Proor. There is no loss of generality in supposing that « < 1 and 
Ao such that, for A eAo, a(A) > a(Ao)(1 — 6)” and H(A; x1)" < 
H(do; *1)” (1 + (1 — ©€)~%]/2. Using Exercise 7b, 

% 
n 
(24) H(d; x(n))* = n*H(d; m1)" S — H (do; m1) “ll + (1 — 4 
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for XeAo. By Theorem 1, if n > 16/67H (Xo; x1)[(1 — €-)~* — 1]?, 
then 


% 9) -2 
(25) E(((n) — )? |) > "s H(do; 1) [1 + (1 — 6) 74] + | 


(1 — 6)” 


~ NH (Ao; X1) 
for some A ¢ Ap. @ 


The next theorem extends Theorem 2 to practically any loss function 
that is twice differentiable in J for / and X close to Apo. 


THEOREM 3 


Hyp. 1. H(A; X,) is positive and continuous at Apo. 
2 
2. a(A) =pr -— Lil; v is continuous at Xo. 
Q) =e sahGir)| ; 
3. Inequality (5.1) holds for \ in Ao. 


CONCL. For sufficiently large n, L(1(n);) > (1 — ©&)a(Ao)/nH (Ap; X1) 
for some A € Ao. 


Proor. It may be supposed without loss of generality that e < 1; 
and that, for l, \ ¢ Ao, L(l; A) > (1 — €)“a(a)(I — A)?. 

It may also be supposed that I(x; ) ¢ Ap. This is so, because it would 
suffice to prove the theorem for a new estimate I’(n), where l’(x; n) is 
defined to be the number in Ag closest to [(z; n), which in turn follows 
from the fact that L(I’(n); A) < L((n); A) for A € Ao. 

These suppositions having been made, the theorem is a direct con- 
sequence of Theorem 2. @ 


CoROLLARY | If L(l; \) satisfies (5.1) and has two derivatives with 
respect to / continuous in A for every ) and for every I sufficiently close 
to A, and if H(A; x,) is continuous and positive, then, for sufficiently 
large n, 


(26) L*(n) => (1 — €) sup a(A)/nA(; x,), 


where L*(n) is the minimax value of the estimation decision problem 
derived from L(l; \) and x(n), unless the supremum in question is in- 
finite, in which case nL*(n) approaches infinity. 


Of course, it would be enough to assume only that L(J; \) and H(A; x;) 
are well behaved at some sequence of values of \ on which the supremum 
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in question is approached. In particular, if the supremum is actually 
attained at some , they need only be well behaved there. 

Now, turning to the sequence of maximum-likelihood estimates, let 
them be denoted for the moment by I(n). It is known that under 
rather general hypotheses n”*(i(n) — d) is asymptotically normal about 
zero with asymptotic variance 1/H(A; x). This suggests, and ex- 
amples tend to confirm, that, under some supplementary conditions, 


27 lim nE((i(n) — \)?) = 
(27) a ((I(n) — d)*) Hacx,) 
Indeed, one set of conditions implying (27) is stated in [W5], but one 
that seems difficult to apply. It can be shown that (27), together with 
the usual asymptotic behavior of 1(n), implies 


, a(n) 
(28) lim nL(1(n); 4) = ————__ 
n— H(n; X1) 

provided, for example, that L(l; \) is bounded for each A and that the 
second derivative of L(J; \) with respect to J exists when / = X. Easily 
applied rigorous theorems implying (28) much less (27) do not seem to 
have been formulated yet; but examples suggest that, under conditions 
general enough for many applications, (28) actually does hold uni- 
formly, in the sense that, for n sufficiently large, 

1 — nN : Le r 
(29) US Coe ee 

nH (yr; X1) nH(y; X;) 
for all X simultaneously. If (29) holds, then, in view of Corollary 1, 
I(n) is nearly minimax for large n, in the sense that 


(30) L*(n) > (1 — ©) sup L(I(n); )). 


Good examples can be based on (a) of Tables 3.1 and 4.1, letting 
L(l; p) be any loss function having two continuous derivatives in l 
throughout 0 <1, p< 1. In particular, the example discussed in 
§ 13.4 arises, if L(l; p) = (1 — p)*®. It can be argued that the phenome- 
non discussed in connection with that example is probably not rare; 


+ Some key references for the asymptotic behavior of T(n) are [K2], [C9], [L3], 
[W16], [N4]. The literature on this subject is extraordinarily complicated. There 
are acknowledged mathematical mistakes in some of its most sophisticated publica- 
tions; others prove much less than any but the most attentive reader would be led 
to suppose; few give an adequate statement of their relations to their predecessors; 
and those that make serious pretentions to rigor involve complicated hypotheses. 
For documentation of this lament see [N4], [W4], and [L3]. 
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because, for minimax I(n), L(1(n); A) is, Judging from examples, often 
constant and, therefore, nearly equal to sup a(A)/nH (a; 21), but L(1; A) 
r 


closely follows the rise and fall of a(\)/nH(\; x;). 

Turn now to Criterion 9, efficiency. It seems difficult to defend the 
criterion as it has been defined in connection with (4.8); for what vir- 
tue is there in the asymptotic normality required by (4.8)? It is per- 
haps noteworthy that the sequence of minimax estimates, p; (7), aris- 
ing in connection with § 13.4 does not satisfy (4.8). Indeed, (13.4.3) 
implies that n”(pi(n) — p) is asymptotically normal not about zero, 
but about (4 — p). 

It is my impression that the essence of the efficiency concept resides 
not in asymptotic normality, but in the overall behavior of the mean 
square error of a sequence of estimates. I therefore propose tentatively 
to modify the definition and to call a sequence of estimates I(n) effi- 
cient, if and only if its mean square error behaves at least as well as 
can typically be expected for a sequence of maximum-likelihood esti- 
mates. 

Formally, I propose to call 1(n) efficient, if and only if, for n sufh- 
ciently large, 


(31) E((l(n) — AP) < 


for every \ simultaneously. 

I think the main objection that is likely to be raised to this proposed 
definition is associated with the possibility that in some problems of 
theoretical, and perhaps also of practical, importance (31) is not satis- 
fied by any sequence of estimates whatsoever, though the maximum- 
likelihood sequence is efficient in the “‘official’’ sense. In such a prob- 
lem, are the maximum-likelihood estimates not as good for all practical 
purposes for sufficiently large n as though their variances were actually 
equal to those of the normal distributions to which they approximate? 
It is natural to think so by analogy with other contexts in the theory 
of probability, but approximate normality is actually no substitute for 
(31) in the present context. The next paragraph is devoted to an ex- 
ample illustrating the inadequacy of asymptotic variance as a measure 
of asymptotic loss. It can be skipped without loss by anyone not in- 
terested in such technicalities. 

The best example I have been able to construct is derived from a se- 
quence of observations that is not a standard sequence. Whether the 
interesting features that it exhibits can actually be realized by standard 
sequences, I do not know; but the example will do to illustrate the is- 
sue. Let y(n) be any real random variable subject to the density 


(1 + e) 
nH (yr; X}) 
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no((y — A)n”; n), defined thus: ¢(z; ) is the standard normal density 
inside the interval [—6(n), 5(n)], 5(m) being such that the standard 
normal probability of this interval is (1 — n~!); $(z; n) = 2~76(2n)/4 
for 6(2n) <|z| <n”; ¢(z; 7) is so defined elsewhere as to be a sym- 
metric positive probability density with the first two moments finite, 
with a bounded derivative approaching zero like z~* with increasing z, 
and with unique absolute maximum at z = 0. It is evident that n” 
(y(n) — X) is asymptotically normal about zero with unit variance. 
The information H(\; y(n)) is well defined (even according to the strict 
conditions imposed by Cramér, Lemma 1, Section 32.2 of [C9]). The 
maximum-likelihood estimates of \ are y(n), and these are also (accord- 
ing to Theorem 3.3 of [G1]) minimax for the simple quadratic loss 
function (1 — \)*. But 


(32) E(ly(n) — A? |) = E(y(n)? | 0) 
1 

2n f y’o(yn”; n) dy 
5(2n)n~ 4 


dn-“{1 —- §(2n)n—”] 6(2n), 


IV 


which does not satisfy (31). Even for the bounded, and therefore more 
realistic, loss function, 


(33) L(l; x) = min {1, [? — d}*}, 


it follows easily from Theorem 3.3 of [G1] that every estimate must 
somewhere incur a loss at least as great as the lower bound established 
by (32). To summarize, there are no estimates efficient in the sense 
of (81), nor even in the sense that would arise from (31) on replacing 
the simple quadratic loss function by a bounded loss function; the se- 
quence of estimates y(n) is efficient in the official sense, so to speak, 
but does not, of course, result in losses of the order of n7~?. 

What can be said in positive Justification of the criterion of efficiency 
as defined bv (81) or the like? Roughly, the elements of such a se- 
quence nearly dominate every estimate for every smooth loss function. 
A little more precisely, for large n, the loss associated with an element 
of a sequence efficient in the sense of (31) is at most larger by a small 
fraction than that of any other estimate, except possibly in some short 
intervals.t The maximum loss of such an element is at most larger by 
a small fraction than the minimax loss, so the elements of the sequence 
are typically nearly minimax. Moreover, they typically have consid- 


+ It has actually been demonstrated that the total length of these exceptional 
intervals (within any fixed interval) is small [L3]. 
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erably smaller losses than any minimax estimate, except in short inter- 
vals that are typically very improbable a priori in the personal sense. 
Thus the principle of admissibility, the minimax rule, and the personal- 
istic concept of probability combine to suggest that efficiency as de- 
fined by (31) is a promising guide in the search for good estimates. 

An extensive critique of the concept of efficiency, including much 
material on its history, has been given by LeCam in [L3], which unfor- 
tunately was not available to me in its entirety as I wrote this section. 

R. A. Fisher’s name is the most prominent in the history of maximum- 
likelihood estimation and efficiency. Some historical details are given 
in [N4] and on p. 45 of Vol. II of [K2]. 


7 A behavioralistic review, concluded 


Criteria 6 (unbiasedness) and 7 are now the only ones in the list for 
which I have not suggested some justification in terms of the theory of 
decision problems, and, indeed, I cannot. Unbiased estimates fascinate 
many theoretical statisticians, including myself, and the study of them 
undoubtedly has certain valuable by-products. Yet it is now widely 
agreed that a serious reason to prefer unbiased estimates seems never 
to have been proposed. 

Three weak defenses are sometimes heard. First, unbiasedness is as- 
serted to have an intuitive appeal; whether it does or not depends, of 
course, on the experience of the intuiter. Second, averages of increas- 
ingly many unbiased estimates are typically consistent. If this is a 
virtue, it is a limited one and pertains to the unbiased estimate not as 
an estimate, but as a step in the definition of other estimates. Third, 
an allusion is made to equity. If, for example, it has been agreed that 
one party will buy a sack of sugar from another at so much per pound, 
it seems fair that the nominal weight of the sack be determined by un- 
biased estimate. This ethical conclusion could perhaps be given some 
justification in terms of approximately linear utility functions or a long- 
run argument, though there is danger of falling into such pitfalls as the 
conclusion that accuracy is unimportant for equity; and it might find 
some application in the theory of barter; but it seems, at best, tangen- 
tial to estimation in the sense of the present chapter. 

For a proper appraisal of the criterion of unbiasedness it should be 
realized that, even if \ admits an unbiased estimate, many not-at-all 
pathological functions of \ (which can in turn be regarded as parame- 
ters), may fail to do so and that such unbiased estimates as \ does admit 
may be preposterous. ‘These phenomena are both illustrated by the 
following simple example. Let x be confined to two values, say 1 and 
2; let PU | A) =1- PQ | \) = A; and let A be confined to the interval 
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[1/3, 2/3]. Then, by definition, 1 is an unbiased estimate of ¢(A), if 
and only if 1(1)A + 1(2)(1 — A) = 1(2) + (1(1) — 1(2))A = o(A)—a con- 
dition that can be met, if and only if ¢ is linear. Suppose, for example, 
#(A) = J for every A, then J(1) = 1, 1(2) = 0 defines the only unbiased 
estimate of ¢(A). This estimate is worse, according to an emphatic 
variant of Criterion 3, than the biased estimate 1’ such that 1'(1) = 2/3 
and 1'(2) = 1/3; for l’ (when it errs at all) errs in the same direction as 
1, but never nearly as far. 

As for Criterion 7, it is on first encounter appealing to postulate that, 
if 1 is usually closer to \ than I’ is, then 1 is better than 1’. But, speaking 
at least for myself, the initial appeal of Criterion 7 seems to have been 
bound up with the conjecture that Criterion 7 is in some sense of the 
same sort as Criterion 3. The example given under Criterion 7 almost 
entirely evaporates the conjecture, and with it the appeal. 

In the paper [P5] in which the criterion is put forward for considera- 
tion and exploration, Pitman mentions that the criterion seems ac- 
ceptable in contexts where ‘‘the devil takes the hindmost.” This allu- 
sion to the devil seems to offer no justification for the criterion as a cri- 
terion of estimation, for I understand the allusion to refer only to the 
following kind of decision problem, which is quite remote from estima- 
tion as ordinarily understood and is hardly ever encountered: A person 
must choose between | and I’, winning a prize if the estimate of his 
choice falls closer to \ than does the other one. 

According to Pitman, the relationship of “better than,” or ‘‘closer 
than”’ as he calls it, defined by Criterion 7, is not necessarily transitive. 
He argues, I think with some Justice, that this breakdown of transitivity 
does not in itself invalidate the criterion when the criterion is applied 
to select the ‘‘best’’ from some prescribed class of estimates; but ‘“‘best”’ 
cannot here be taken literally. 

Criterion 7 is unusual in that it depends on the Joint conditional dis- 
tributions of pairs of estimates rather than on the distributions of each 
estimate considered separately. On any ordinary interpretation of es- 
timation known to me, it can be argued (as it was under Criterion 3) 
that no criterion need depend on more than the separate distributions. 


CHAPTER 16 


Testing 


1 Introduction 


In principle, this chapter on the statistical process of testing (often 
referred to more fully as making tests of hypotheses or significance 
tests) might have been organized on the pattern of the preceding chap- 
ter on point estimation: a statement of verbalistic ideas, followed by 
motivation and criticism in terms of behavioralistic ideas. But I am 
dissuaded from repeating that pattern by several considerations. It 
would, in the first place, be needlessly repetitious. Thus, in the pres- 
ence of the preceding chapter I need mention only in passing that suffi- 
cient statistics and symmetry play the same role in testing as in other 
observational decision problems, and that a certain scheme of testing, 
closely related to maximum-likelihood estimation, has asymptotic, or 
large sample, virtues. Again, the pattern of the preceding chapter is 
less attractive here, because the criteria for tests developed in the ver- 
balistic tradition do not on the whole seem to have such satisfying be- 
havioralistic motivation as do their counterparts in the theory of point 
estimation. Finally, it is Inappropriate to attempt anything like a 
complete list of verbalistic criteria for tests here, especially in view of 
the availability of two excellent and mutually complementary key ref- 
erences (Chapters 21, 26, and 27 of [K2]; and [L4]). 

The organization actually adopted is this: First, testing and criteria 
for tests are discussed from a frankly behavioralistic viewpoint. In 
this discussion ideas stemming from the verbalistic tradition are used 
freely, and some criteria of the verbalistic tradition are criticized. Sec- 
ond, an attempt is made to analyze some of the important statistical 
situations to which the theory of testing is ordinarily applied. It is 
becoming increasingly recognized that many of these applications are 
very crude, and that their replacement by sounder procedures consti- 
tutes some of the most important and provocative statistical problems 
of today. 

Terms introduced in boldface in this chapter are among the most 
frequent in ordinary statistical usage. The definitions given are in- 
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tended to be in reasonable accord with that usage, but some small con- 
cessions are made to the particular form in which the theory of testing 
is expressed here. 


2 A theory of testing 


Verbalistically, the problem of testing means to guess, on the basis 
of observation, which of two disjoint and mutually exhaustive hypoth- 
eses obtains. Behavioralistically, this would generally be agreed to 
point to the definition: A é<sting problem is an observational decision 
problem derived from exactly two basic acts fo and f;. These two basic 
acts are called (for a reason that will soon be clear) accepting and re- 
jecting the null hypothesis, respectively. 

Considered abstractly as bilinear games, testing problems may, so 
far as I know, have no special feature beyond the uninteresting one 
that one of two f’s is appropriate to each 7. But, considered as obser- 
vational problems, testing problems do present some interesting special 
features. In the first place, since at least one of the two basic acts is 
appropriate to each 7, the set J of all 2’s can be partitioned into three 
sets, Ho, H,, and N, defined thus: 


L(fp;7) =0 and L(f,;;7) >0 forze Ho, 
(1) L(fo37) >O and L(fi;7) =O fori e Ay, 
L(fo;7) =O and L(f;;7) =0 forieN. 


When it is recalled that the z’s correspond to a partition B; of S, the 
sets Hp, Hi, and N may, with a slight clash of logical gears, be regarded 
as three events partitioning S. The traditional names of Ho and Ay, 
are the null and the alternative hypothesis, respectively; NV, being quite 
unimportant and often either ignored or made vacuous by some trick 
of definition, has no such name. Rejecting the null hypothesis when it 
does in fact obtain and accepting it when it does not obtain are called 
errors, more specifically errors of the first and second kind, respec- 
tively. 

A test is a derived act of a testing problem. A test may conveniently 
be identified with the real-valued contraction z of the observation x, 
such that z(x) 1s the probability prescribed by the test for rejection of 
the null hypothesis in case x is observed. An unmixed test (which was 
until recently the only kind contemplated) corresponds to a z confined 
to the two values 0 and 1, which respectively imply outright acceptance 
and rejection of the null hypothesis. 
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The loss associated with the test z when 7 obtains is clearly 


(2) L(z; 4) = Lo; )E(1 — z| 4) + Li; EC | 4) 
= L{f,; i) E(z| 7) for ie Ho 
= L(f;7)[1— E(7z|a)] forte Hy 
= 0 forzeN. 


The functions H(z | z) and [1 — E(z | 1)| are, respectively, the proba- 
bility of rejecting and accepting the null hypothesis with the test z 
when 7 obtains. There is obviously nothing to choose between them 
in importance or convenience, each being equivalent to the other. 
They are commonly called the power function, and operating charac- 
teristic, respectively. 

In view of (2), one test z dominates another 2z’, if and only if 


E(z|\i) < E(z’|1) = fori e Ho 


(3) Ve ; 
E(z| 7) > Kz | 2) fori ¢ Hy; 


or, again, if and only if the probability of error with z’ is at least as 
great as with z for every 7. Thus, dominance, admissibility, and equiv- 
alence depend on the basic loss function, L(f,; 2), only in so far as that 
function determines Hp and H,. This is not only remarkable but also 
useful; for Hp and H, may well be clearly defined in contexts where 
the basic loss is vague, or otherwise ill determined. 

If z is admissible in the spirit of (3) relative to a pair of sets Hp and 
Hy, then (if © is for the moment admitted as a possible value for a loss) 
there exists a basic loss function leading to Ho and Hy and having z 
as its essentially unique minimax. Indeed, let 


L(fp;2) = (1 — E(z|a)}"! ~— fori e Hy 


= 0 elsewhere; 

(4) Re 
L(f\; 2) = E(z| 1) for ie Hy 
= 0 elsewhere. 


With this loss and reckoning 0-% = 0 (as is appropriate here), L(z | 1) 
= 1 or 0, according as there is or is not positive probability of making 
an error at 2 with z. In view of (2) and (4), any minimax z’ not equiva- 
lent to z would strictly dominate z, contrary to the assumption that z 
is admissible. The moral of that conclusion can be put thus: Without 
special assumptions about the basic loss, the principle of admissibility 
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and the minimax rule lead to no criteria expressible solely in terms of 
Ho, H,, and the conditional distributions of the observation x other 
than that of admissibility itself. Whether some other objectivistic prin- 
ciple could justify such criteria may be considered an open question, 
but, as I have already said (in § 15.1), no other general objectivistic 
principles have been seriously maintained. 

It is natural, for example, to demand that z have the same symmetry 
as P(x | 7) and Hp and Hi; but that criterion can surely not be justified 
at all, unless the basic loss is also assumed to have the same symmetry, 
the justifiability of which in turn depends on the case. 

To take another important example, it is often proposed that a satis- 
factory test must be unbiased,{ that is, its power function must never 
be higher in Hp than in H,. More formally, the test z is unbiased, if 
and only if 


(5) E(z| io) < E(z| i) 


for every 7) ¢ Hp and every 2; ¢ Hj. 

Assuming that L(fp; 7) and L(f,; 7) are constant in H, and Ap, re- 
spectively, it will be shown that any minimax must be unbiased. As a 
step toward that demonstration, consider a testing problem as a mini- 
max problem, without any special assumption about the basic loss. 
It is possible that L* = 0, in which case the minimax tests are all equiv- 
alent and all unbiased. Putting that possibility aside, I assert, and will 
show, that (under the usual mathematical simplifications) 


(6) max L(z; 7) = max L(z;7) = L* 
1¢€ Ho te Hy 
for any minimax z. It is obvious that neither maximum exceeds L”*, 
and also that one or the other must equal L*. But suppose, for exam- 
ple, that the second maximum were actually less than L*, and consider 
z’= az with0O<a<1. According to (2), if z’ is substituted for z, 
the first maximum in (6) will be depressed, and, for a sufficiently close 
to 1, the second would remain actually less than L*, which contradicts 
the assumption that z is minimax, establishing (6). 
Now make the special assumption that 


7) L(fo; 7) = A fori e Hy, 
L(fy;7) = B forz¢ Ho, 


and suppose that z could be minimax but biased. There would then 


} A definition unifying the various concepts of unbiasedness in statistics is put 
forward in [L5]. 
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exist 7) ¢ Hp and 7; e H; such that 
(8) L* = L(z;%)) = BE(z|%) = A — AE(z| 71) = L(z; 11), 


and such that H(z; t9) > H(z; 7,;). But consideration of the test that 
simply assigns to every x the number 8 midway between E(z; to) and 
K(z; 11) shows that z could not be minimax. 

The condition (7) is a reasonable assumption in some testing problems, 
and, where (7) is satisfied, the criterion of unbiasedness has such sup- 
port as the minimax rule can give. In many other typical testing prob- 
lems, however, there are borderline errors that hardly matter at all but 
can scarcely be prevented, and serious errors that can largely be pre- 
vented. The following example, which can be varied to suit diverse 
tastes, shows that it can be folly to insist on unbiasedness in such 
problems. 

Let 7 take the three values 0, 1, 2, and let x take the values 0 and 1 
with conditional probabilities defined thus: 


(9) P(0|0) = 99/100, P(O|1)=0, P(O|2) =1. 


Let the basic loss be defined by the condition that 7 ¢ Ho or 7 ¢ Hj, ac- 
cording as 1 = 0 or not, and by 


(10) L(f,; 0) = 1, L(fo; 1) = 1, L(fo; 2) = 1/101. 


Then 
L(z; 0) = [992(0) + z2(1)]/100 


(11) L(z; 1) = 1 —2(1) 
L(z; 2) = [1 — 2(0)]/101. 


It is easily verified that the only minimax z* is defined by z*(0) = 0, 
2*(1) = 100/101, and that L(z*; 7) = L* = 1/101 for every 1. But it 
is also easily verified that the only unbiased tests are absurd in that 
they ignore the observation x; they are in fact just those for which 
2(0) = 2(1). 

It has until quite recently been said by many that attention should 
be confined to tests such that there is a fixed probability a (called the 
size of the test) of making an error of the first kind for every 7 ¢ Ho. 
Indeed, the criterion of size has often been taken so seriously as to be 
incorporated into the very definition of a test. Though many impor- 
tant tests happen to have a size, others equally important do not; so 
it now seems to be recognized [L4] that the possession of a size cannot 
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be taken seriously as a criterion.t To take an everyday example, con- 
sider the binomial distributions 


101 
(12) P(z| p) = ( i ) va — p)'-*, 


where the parameter p confined to [0, 1] plays the role of 7 and x = 0, 
--+, 101; and suppose that Ho is the hypothesis that p < 1/2. A test 
of size a is a test for which 


100 
(13) a 2(2)( A ) ma ~ p)'"* = a 


for all p < 1/2. This obviously implies 


(14) E eta) - al("")(—2-) = 0 


x l—p 


for all p < 1/2, whence z(x) = a for every x. So only absurd tests 
have size, in this example, though there are clearly tests here that are 
quite satisfactory for many applications, for example, let z(x) equal 0 
or 1 according as x < 50 or x > 50. 

In view of the criticism just made, there is a tendency to redefine 
size so that any test has a size a, namely, 


(15) a = pz max E(z | 7). 
t€ Ho 

In terms of this definition of size, a concept of testing somewhat differ- 
ent from that proposed in this section has been defined and defended 
(Wald, p. 21 of [W3], and Lehmann, pp. 17-18 of [L4]; namely, it is 
postulated that a test is to be chosen not from among all possible tests, 
but only from among those having a size a (in the sense of (15)) given 
as part of the testing problem.{ This concept of testing is not defended 
to the exclusion of the one proposed here, but it is asserted by the 
authors cited to be more realistic for some problems. The arguments of 
both authors on this point are similar and, I think, quite weak in two 
crucial places, for the advantage is supposed to flow in some unspeci- 
fied way from the undemonstrated impossibility of comparing prefer- 
ences for consequences of qualitatively different kinds. It seems, if I 
may be allowed such a conjecture, that the concept of testing under a 


t Statisticians interested in the Behrens-Fisher problem may be interested in pp. 
35.173a—-b of [F6], which hinge on the question of size as a criterion. 

t The constraint actually imposed, especially by Lehmann [L4], is that the size 
be at most a. But, as Lehmann explains, this difference is more apparent than real. 
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constraint of size represents a Procrustean attempt to fit the (older) 
Neyman-Pearson theory of testing hypotheses too closely with the 
(newer) minimax theory. It is not to be denied, of course, that there 
may sometimes be a mathematical advantage in studying and compar- 
ing tests of given size. 

It should be mentioned, before concluding the subject, that any the- 
ory taking size seriously introduces an asymmetry of the theory with 
respect to Hp and H,, an asymmetry that is surely not always appropri- 
ate. 

Significance level, or level of significance, is a synonym (neglecting 
a slight distinction made in [L4]) of size, probably more widely used 
than size itself. 


3 Testing in practice 


The theory of testing admits some fairly realistic applications, but 
the present state of statistics is such that the theory of testing is in- 
voked more often than not in problems on which it does not bear 
squarely. This section discusses typical applications of the theory, 
pointing out the shortcomings I am aware of. 

The development of the theory of testing has been much influenced 
by the special problem of simple dichotomy, that is, testing problems 
in which Hp and H have exactly one element each. Simple dichotomy 
is susceptible of neat and full analysis (as in Exercise 7.5.2 and in 
§ 14.4), likelihood-ratio tests here being the only admissible tests; and 
simple dichotomy often gives insight into more complicated problems, 
though the point is not explicitly illustrated in this book. 

Coin and ball examples of simple dichotomy are easy to construct, 
but instances seem rare in real life. The astronomical observations 
made to distinguish between the Newtonian and Einsteinian hypotheses 
are a good, but not perfect, example, and I suppose that research in Men- 
delian genetics sometimes leads to others. There is, however, a tradi- 
tion of applying the concept of simple dichotomy to some situations to 
which it is, to say the best, only crudely adapted. Consider, for ex- 
ample, the decision problem of a person who must buy, fo, or refuse to 
buy, f,, a lot of manufactured articles on the basis of an observation x. 
Suppose that 7 is the difference between the value of the lot to the per- 
son and the price at which the lot is offered for sale, and that P(x | 1) 1s 
known to the person. Clearly, Hp, H;, and N are sets characterized 
respectively by « > 0,72 <0,72=0. This analysis of this, and similar, 
problems has recently been explored in terms of the minimax rule, for 
example by Sprowls [816] and a little more fully by Rudy [R4], and by 
Allen [A3]. It seems to me natural and promising for many fields of 
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application, but it is not a traditional analysis. On the contrary, much 
literature recommends, in effect, that the person pretend that only two 
values of 7, i9 > 0 and 7; < 0, are possible and that the person then 
choose a test for the resulting simple dichotomy. The selection of the 
two values zp and 7; is left to the person, though they are sometimes 
supposed to correspond to the person’s judgment of what constitutes 
good quality and poor quality—terms really quite without definition. 
The emphasis on simple dichotomy is tempered in some acceptance- 
sampling literature, where it is recommended that the person choose 
among available tests by some largely unspecified overall consideration 
of operating characteristics and costs, and that he facilitate his survey 
of the available tests by focusing on a pair of points that happen to in- 
terest him and considering the test whose operating characteristic 
passes (economically, in the case of sequential testing) through the 
pair of points. These traditional analyses are certainly inferior in the 
theoretical framework of the present discussion, and I think they will 
be found inferior in practice. 

To make a small digression, there is a complication in connection with 
testing whether to buy that is not ordinarily envisaged by statistical 
theory; namely, the economic reaction between the buyer and the sup- 
plier. If, for example, the supplier knows the test the buyer is going 
to apply, that knowledge will influence the quality of the lot supplied. 
There seems to be little, if any, successful work oh the economic prob- 
lem thus raised about the game-like behavior of the two people involved 
(cf. pp. 331, 340, and 346 of [W6)}). 

The problem whether to buy a lot obviously has many formal coun- 
terparts in other domains. In some of them it is particularly clear that 
purely objectivistic methods do not suffice. To illustrate, imagine two 
experiments: one designed to determine whether it is advantageous to 
add a certain small amount of sodium fluoride to the drinking water of 
children, the other to determine whether the same amount of oil of 
peppermint is advantageous. Granting that each of the two additions 
can be made at the same cash cost for labor and material and that the 
designs of the two hypothetical experiments differ only in the inter- 
change of the roles of sodium fluoride and oil of peppermint, the corre- 
sponding testing problems are objectivistically completely parallel, that 
is, the same with regard to loss function and conditional probability of 
the observations. But it must be acknowledged, I. think, that the people 
actually charged with the decision in either of these two cases would 
and should take into account opinions they had before the observation. 
For example, they might originally have considered it nearly impossible 
that the oil of peppermint could result in any hygienic advantage large 
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enough to compensate for even the small cost of its administration, but, 
in view of recent dental researches on the subject, they might not have 
considered it at all unlikely that the sodium fluoride should have an 
overall advantage. In that case, parallel observations in the two ex- 
periments would not always lead to parallel decisions. Objectivists 
typically admit such a possibility but go on to say that it is unreasonable 
to isolate the experiment and that it is the totality of information bear- 
ing on the subject that should be treated objectivistically. If objectiv- 
ists could give a more detailed discussion of how to deal with such a 
totality of information, it might do much to clarify their position. 

I turn now to a different and, at least for me, delicate topic in connec- 
tion with applications of the theory of testing. Much attention is given 
in the literature of statistics to what purport to be tests of hypotheses, 
in which the null hypothesis is such that it would not really be accepted 
by anyone. The following three propositions, though playful in con- 
tent, are typical in form of these extreme null hypotheses, as I shall call 
them for the moment. 


A The mean noise output of the cereal Krakl is a linear function of 
the atmospheric pressure, in the range from 900 to 1,100 millibars. 


B The basal metabolic consumption of sperm whales is normally 
distributed [W11]. 


C New York taxi drivers of Irish, Jewish, and Scandinavian extrac- 
tion are equally proficient in abusive language. 


Literally to test such hypotheses as these is preposterous. If, for ex- 
ample, the loss associated with f, is zero, except in case Hypothesis A 
is exactly satisfied, what possible experience with Krakl could dissuade 
you from adopting f,? 

The unacceptability of extreme null hypotheses is perfectly well 
known; it is closely related to the often heard maxim that science dis- 
proves, but never proves, hypotheses. The role of extreme hypotheses 
in science and other statistical activities seems to be important but ob- 
scure. In particular, though I, like everyone who practices statistics, 
have often ‘‘tested’’ extreme hypotheses, I cannot give a very satisfac- 
tory analysis of the process, nor say clearly how it is related to testing 
as defined in this chapter and other theoretical discussions. None the 
less, it seems worth while to explore the subject tentatively; I will do 
so largely in terms of two examples. 

Consider first the problem of a cereal dynamicist who must estimate 
the noise output of Krakl at each of ten atmospheric pressures between 
900 and 1,100 millibars. It may well be that he can properly regard the 
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problem as that of estimating the ten parameters in question, in which 
case there is no question of testing. But suppose, for example, that 
one or both of the following considerations apply. First, the engineer 
and his colleagues may attach considerable personal probability to the 
possibility that A is very nearly satisfied—very nearly, that is, in terms 
of the dispersion of his measurements. Second, the administrative, 
computational, and other incidental costs of using ten individual esti- 
mates might be considerably greater than that of using a linear formula. 
It might be impractical to deal with either of these considerations very 
rigorously. One rough attack is for the engineer first to examine the 
observed data x and then to proceed either as though he actually be- 
lieved Hypothesis A or else in some other way. The other way might be 
to make the estimate according to the objectivistic formulae that would 
have been used had there been no complicating considerations, or it 
might take into account different but related complicating considera- 
tions not explicitly mentioned here, such as the advantage of using a 
quadratic approximation. It is artificial and inadequate to regard this 
decision between one class of basic acts or another as a test, but that 
is what in current practice we seem to do. The choice of which test 
to adopt in such a context is at least partly motivated by the vague 
idea that the test should readily accept, that is, result in acting as though 
the extreme null hypotheses were true, in the farfetched case that the 
null hypothesis is indeed true, and that the worse the approximation of 
the null hypotheses to the truth the less probable should be the ac- 
ceptance. 

The method just outlined is crude, to say the best. It is often modi- 
fied in accordance with common sense, especially so far as the second 
consideration is concerned. Thus, if the measurements are sufficiently 
precise, no ordinary test might accept the null hypotheses, for the ex- 
periment will lead to a clear and sure idea of just what the departures 
from the null hypotheses actually are. But, if the engineer considers 
those departures unimportant for the context at hand, he will justifiably 
decide to neglect them. 

Rejection of an extreme null hypothesis, in the sense of the foregoing 
discussion, typically gives rise to a complicated subsidiary decision 
problem. Some aspects of this situation have recently been explored, 
for example by Paulson [P3], [P4]; Duncan [D11!, [D12]; Tukey [T4], 
[T5]; Scheffé (S7]; and W. D. Fisher [F7]. 

To summarize abstractly, I would say that, in current practice, so- 
called tests of extreme hypotheses are resorted to when at least a little 
credence is attached to the possibility that the null hypothesis is very 
nearly true and when there is some special advantage to behaving as 
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though it were true. One other illustration will make it clear that point 
estimation is not essential to the situation and that belief in the approxi- 
mate truth of the null hypothesis alone does not always justify testing. 

Consider the personnel manager of a great New York taxi company. 
Wishing, of course, that his drivers should be as proficient as possible, 
he would, under simple circumstances, hire exclusively from the na- 
tional-extraction group that had obtained the highest mean scores in a 
standard proficiency examination; for why should he not be guided by 
a positive indication, however slight? A statistical test of the extreme 
Hypothesis C would not, therefore, be called for, as has been pointed 
out in general terms by Bahadur and Robbins [B3]. Even strong be- 
lief that ethnic differences are extremely small in the respect in question 
would not alone be any reason for departing from this simple policy, 
dictated by the principle of admissibility—quite in contrast to the ex- 
ample framed around Hypothesis A. If, however, public opinion, a 
shortage of labor, or administrative difficulty militates against any dis- 
crimination at all, the manager may resort to a test based on the ex- 
amination scores. 

In practice, tests of extreme hypotheses are typically chosen from a 
relatively small arsenal of standard types, or families, each family con- 
sisting of one unmixed test at every significance level (as size is always 
called in this context). In publications, it is standard practice not 
simply to report the result of a test, but rather to report that level of 
significance for which the corresponding test of the relevant family 
would be on the borderline between acceptance and rejection. The 
rationale usually given for this procedure is that it enables each user 
of the publication to make his own test at the significance level he deems 
appropriate to his particular problem. Thus the significance level 1s 
supposed to play much the same practical role as a sufficient statistic. 

An interesting contribution to the theory of extreme hypotheses is 
given by Bahadur [B1] in the special context of the two-sided ¢-test. 


CHAPTER 17 


Interval Estimation 


and Related Topics 


1 Estimates of the accuracy of estimates 


The doctrine is often expressed that a point estimate is of little, or 
no, value unless accompanied by an estimate of its own accuracy. This 
doctrine, which for the moment I will call the doctrine of accuracy estt- 
mation, may be a little old-fashioned, but I think some critical discus- 
sion of it here is in order for two reasons. In the first place, the doctrine 
is still widely considered to contain more than a grain of truth. For 
example, many readers will think it strange, and even remiss, that I 
have written a long chapter (Chapter 15) on estimation without even 
suggesting that an estimate should be accompanied by an estimate of 
its accuracy. In the second place, it seems to me that the concept of 
interval estimation, which is the subject of the next section, has largely 
evolved from the doctrine of accuracy estimation and that discussion 
of the doctrine will, for some, pave the way for discussion of interval 
estimation. 

The doctrine of accuracy estimation is vague, even by the standards 
of the verbalistic tradition, for it does not say what should be taken 
as a measure of accuracy, that is, what an estimate of accuracy should 
estimate. Any measure would be rather arbitrary; a typical one, here 
adopted for definiteness, is the root-mean-square error, 


(1) E“(1— AG)? | B) = {(VaA| B) + [EA] B) — AMP}, 


using (15.6.23). The root-mean-square error reduces to the standard 
deviation, V’*(1 | B;), 1n case the estimate 1 is unbiased. 

Faking the doctrine literally, it evidently leads to endless regression, 
for an estimate of the accuracy of an estimate should presumably be 
accompanied by an estimate of its own accuracy, and so on forever. 

Even supposing that the doctrine were somehow purged of vagueness 
and endless regression, it would still be in clear conflict with the be- 
havioralistic concept of estimation studied in Chapter 15. If a decision 
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problem consists in deciding on a number in the light of an observation, 
the person concerned wants to adopt an 1 that is, in some sense or 
other, as good as possible; but, since he must make some decision, it 
could at most satisfy idle curiosity to know how good the best is— 
idle, I say, because, his decision once made, there is no way to use knowl- 
edge of its accuracy. 

Since it seems to me that the kind of problem envisaged in Chapter 
15 is of frequent occurrence and may properly be called estimation, 
I am inclined to say that the doctrine of accuracy estimation is errone- 
ous. However, it is possible that someone should point out a different 
class of problems, also properly called problems of estimation, with re- 
spect to which the doctrine has some validity; though, so far as I know, 
this has not yet occurred. 

One sort of situation that might, through what I would consider 
faulty analysis, seem to support the doctrine of accuracy estimation is 
illustrated by the following, highly schematized example. A person 
has to estimate the number n of replacement parts of a certain sort 
that should be carried by an expedition. He can conduct a trial the 
outcome of which will, let us say, be an observation x distributed in 
the Poisson distribution with mean equal to acn; that is, 


(2) P(a | n) =e *"(acn)*/z}, 


where a is a known constant and c, which the person can choose, is the 
cost (beyond overhead) of the trial. Under reasonable hypotheses, 
once c has been chosen and the value x observed, n(x) = x/ac is a good 
estimate of n; and in so far as the problem is of the type envisaged in 
Chapter 15, that is the end of the matter. 

But there may be features of the problem that have not yet been 
stated, though in principle they should have been. In particular, it 
may be that the person is free to conduct a second trial, though there 
will typically be a high penalty for doing so. One rough, but sometimes 
natural and practical, step toward deciding whether a second trial is 
called for is to remark that (n/ac)” is a good estimate of the root-mean- 
square error of n and may give a fairly good basis on which to judge 
whether the risk of misestimation warrants the expense of a second 
trial. 

My own conviction is that we should frankly regard such a problem 
as has just been described as a special problem in sequential analysis 
and treat it as an organic whole. Viewed thus, c is to be chosen in the 
light of the possibility of making a second trial. The decision to be 
based on x is the complex one of whether to go to the expense of a second 
trial; if so, of what magnitude; and, if not, what estimate of n to adopt. 
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Another sort of situation that seems to have stimulated the doctrine 
of accuracy estimation is the following. Suppose that a research worker 
has observed x, ---, Xn, which are independent and normally distributed 
about the mean yu with variance o” given » and oc. If he wishes to pub- 
lish the results of his investigation for all concerned to use as their own 
needs and opinions may dictate, he should, ideally, publish a sufficient 
statistic of his observation, stating how it is distributed given yp and o. 
Any other course may deprive some reader of some information he 
might be able to put to use. So far as the primary aim is concerned, all 
sufficient statistics are equivalent, but secondary considerations greatly 
narrow the research worker’s choice. To illustrate, consider the five 


sufficient statistics the values of which for {2,, ---, rn} are: 
(a) {21, +++, Zn}. 
(b) The n order statistics of {21, ---, tn}. 


(c) >> a; and >> 2,’. 
(d) =p D) ai/n and 8? =p; (D2? — £ DD) 2,)/n — 1. 


(e) Zand s/n”. 


If n is at all large, (c), (d), and (e) are cheaper to publish than (a) 
and (b). Moreover, for almost any use to which a reader might wish 
to put the data, (c), (d), and (e) will save him a considerable amount 
of computation. In so far as it is true that almost any reader who has 
a use for the data at all will use Z, but not necessarily >> 2;, statistics 
like (d) and (e) are slightly preferable to (c). There is something to be 
said both for (d) and for (e), in view of the ready availability of certain 
tables; but, at least when 7 is very large, there is a slight advantage to 
(e) for those calculations a reader is most likely to perform. In par- 
ticular, a reader using (e) can, when 7 is large, often ignore the actual 
value of n. Even if the distributions of the x;, ---, X, are not exactly 
normal, (c), (d), and (e) often can play almost the same role as suff- 
cient statistics. It is no wonder then that (e) is often chosen as a con- 
venient way to present data. But, in my opinion, it is a mistake to 
lay great theoretical emphasis on the fact that (e) happens to consist 
of what is ordinarily a good estimate of u, namely z, together with what 
is ordinarily a good estimate of the root-mean-square error of that es- 
timate, namely s/n”. 


2 Interval estimation and confidence intervals 


The verbalistic tradition has suggested a procedure different from 
point estimation but somehow related to it. This other procedure, here 
called interval estimation, can be defined as follows, though the defini- 
tion is necessarily vague. Where x is an observation subject to the 


260 INTERVAL ESTIMATION AND RELATED TOPICS [17.2 


conditional distributions P(z | B;) and (2) 1s a function of 7, guess 
that A(z) lies in some set M(x) (to be called an interval estimate) de- 
termined for each value of x. It is almost a part of the definition to 
say that the function M(z) is to be so chosen that P(A(2) ¢ M(x) | B;) 
shall be nearly 1 for every 7 and that M(x) should tend to be small and 
“close knit’’ in a geometrical sense, some compromise being effected be- 
tween these two conflicting desiderata. The parameter A(z) could in 
principle be a very general function, but it will here be enough to sup- 
pose for definiteness and simplicity that A(z) 1s real. Though more 
general possibilities are contemplated in principle, the set M(x) is in 
practice typically a bounded interval, which corresponds with what I 
meant in saying that M(x) is supposed to be ‘‘close knit.” 

The idea of interval estimation is complicated; an example is in order. 
Suppose that, for each \, x is a real random variable normally distrib- 
uted about A with unit variance; then, as is very easy to see with the 
aid of a table of the normal distribution, if 1/(x) is taken to be the in- 
terval [x — 1.9600, x + 1.9600], then 


(1) P(\ e M(x)|d) = a, 


where a@ is constant and almost equal to 0.95. 

It is usually thought necessary to warn the novice that such an equa- 
tion as (1) does not concern the probability that a random variable \ 
lies in a fixed set M(x). Of course, d is given and therefore not random 
in the context at hand; and, given X, a@ is the probability that M(x), 
which is a contraction of x, has as its value an interval that contains }. 

Why seek an interval estimate? One sort of verbalistic answer runs 
like this: At first glance, the problem of estimation seems to require 
that a person guess, on observing that x takes the value z, that A(z) 
has some particular value [(x); but, since it is virtually impossible that 
such a guess should be correct, it seems better to try something else. 
In particular, it is often possible to assert that A(z) 1s in a comparatively 
narrow interval M(x), chosen according to such a system that it 1s very 
improbable for each 7 that the assertion will be false. Less extreme ver- 
balistic explanations tend to give the impression that point estimation 
need not be altogether rejected, but that interval estimation satisfies 
a parallel need. 

The first part of the explanation just cited is specious, since no one 
really expects a point estimate to be correct, and since, when one really 
is obliged by circumstances to make a point estimate in the behavioral- 
istic sense, there is no escaping it. None the less, that part of the ex- 
planation does seem to give some insight into the appeal of interval es- 
timation. The second part of the explanation is a sort of fiction; for it 
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will be found that whenever its advocates talk of making assertions that 
have high probability, whether in connection with testing or estima- 
tion, they do not actually make such assertions themselves, but end- 
lessly pass the buck, saying in effect, ‘““This assertion has arisen accord- 
ing to a system that will seldom lead you to make false assertions, if 
you adopt it. As for myself, I assert nothing but the properties of the 
system.” 

From the behavioralistic point of view, I maintain that point estima- 
tion fulfils an important function. On the other hand, I can cite no 
important behavioralistic interpretation of interval estimation. More- 
over, in such direct and indirect contact as I have had with actual sta- 
tistical practice, I have—with but one extraordinary exception, which 
will soon be discussed—encountered no applications of interval estima- 
tion that seemed convincing to me as anything more than an informal 
device for exploring data or crudely summarizing it for others. In 
short, not being convinced myself, I am in no position to present con- 
vincing evidence for the usefulness of interval estimation as a direct 
step in decision. The reader should know, however, that few are as 
pessimistic as I am about interval estimation and that most leaders in 
statistical theory have a long-standing enthusiasm for the idea, which 
may have more solid grounds than I now know. 

The following is a schematized example of one sort of decision prob- 
lem that does call for something like interval estimation. An observa- 
tion x bears on the position of a lifeboat, the occupants of which will 
be saved or lost according as the boat is or is not sighted by a search- 
ing aircraft before nightfall. The decision problem is, therefore, to 
choose, from all the domains that the airplane could search in time, one 
domain M(x); and the loss must, in effect, be reckoned as 0 or 1 accord- 
ing as M(x) does or does not contain A. This type of problem seems, 
however, too rare and too special to be taken as representative of those 
for which interval estimation is so widely advocated. 

Many criteria have been put forward for interval estimation, but I 
am of course in no position to discuss them critically. J. Neyman has 
gone about the search for criteria systematically, setting up a parallel- 
ism between the theory of interval estimation and of testing. In par- 
ticular, paralleling the criterion of fixed size for tests, he has emphasized 
interval estimates such that 


(2) P(A() e M(x) | B) =a 


for a fixed a (typically close to 1) and for every 7. Such interval esti- 
mates are called confidence intervals at the confidence level a. The 
interval estimate mentioned in connection with (1) 1s obviously a con- 
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fidence interval. Wald [W3] sought to include the theory of confidence 
intervals in the minimax theory, but in my opinion he did not succeed 
in giving interval estimation a behavioralistic interpretation. 

Though I am in no position to criticize any criterion of interval es- 
timation, I venture to ask whether (2) is not gratuitous, as I have more 
positively asserted of its analogue in the theory of testing. 

Chapters 19 and 20 of [K2] will serve as key references for interval 
estimation. 


3 Tolerance intervals 

There has recently been considerable study of what are called toler- 
ance intervals (or limits). They are related to the problem of guessing 
the actual value of a real random variable y, on the basis of an obser- 
vation of x. A tolerance interval for y at tolerance level a and confi- 
dence level 6 is an interval-valued function Y(x) such that 


(1) P[P(y « Y(z)| Bi, t) > |B) =8 


for every 1. 

The concept expressed by (1) is a slippery one; perhaps it will help 
to express it in words thus: For every B,, there is probability 6 that z is 
such that y will fall in Y(x) with probability at least a, given B; and 
z. In typical applications y is independent of x; this permits a slight 
simplification of the definition. The notion of tolerance interval seems 
to me at least as unamenable to behavioralistic interpretation as that 
of confidence interval, and I therefore venture no discussion of it here. 
Key references are [B22] and [W7]. 


4 Fiducial probability 

This is not really a section on fiducial probability, but rather an 
apology for not having such a section. The concept of fiducial proba- 
bility put forward and stressed by R. A. Fisher is the most disputed 
technical concept of modern statistics, and, since the concept is largely 
concerned with interval estimation, I wanted to discuss it here. I 
have, however, been privileged to see certain as yet unpublished manu- 
scripts of R. M. Williams [W12] and J. W. Tukey which convince me 
that such discussion by me now would be premature. 

Some key references to fiducial probability and to the Behrens-Fisher 
problem, which is the most disputed field of application of fiducial 
probability, are Fisher’s own papers, especially [F5], and Papers 22, 
25, 26, 27, and 35 of the collection [F6]; Kendall [K2], Chapter 20; 
Yates [Y1]; Owen [01]; Segal [S9]; Bartlett [B6]; Scheffé [S6], [S95]; 
Walsh [W9]; and Chand [C5].* 

+ And I can now add Barnard (1963), Dempster (1964), Fisher (1956, Sec- 


tions III 3, IV 6, V 5, V 8, VI 8, VI 12), Linnik (1968, Chapters VIII-X), 
Patil (1965), Scheffé (1970), Tukey (1957), and Williams (1966). 


APPENDIX 1 


Expected Value 


This appendix, a brief account of some relatively elementary aspects 
of the badly named mathematical concept, expected value, is presented 
for those who might otherwise be handicapped in reading this book. 
No proofs are given here, but the reader who needs this appendix will 
probably be willing and able to accept the facts cited without proof, 
especially if he acquires intuition for the subject by working the sug- 
gested exercises. The requisite proofs are, however, given implicitly 
in any standard work on integration or measure (e.g., Chapters I-V of 
[H2)]). 

Throughout this appendix, let S be a set with elements s and subsets 
A, B, C, --- on which a (finitely additive) probability measure P is 
defined. Bounded real random variables, that is, bounded real-valued 
functions, defined for each s <S, will here be denoted by x, y, ---, and 
real numbers by 2, y, z, and lower-case Greek letters. 

The expected value of x, generally written E(x), is characterized as 
the one and only function attaching a real number to every bounded 
random variable x, subject to the following three conditions for every 
X, y, p, 0, and B: 


(1) E(ox + oy) = pE(x) + cE(y). 
(2) E(x) >0 whenever P(z(s) < 0) = 0. 
(3) E(c(| B)) = P(B). 


In (3), c(| B) is the characteristic function of B, that is, c(s| B) = 1, 
if s ¢ B, and c(s | B) = 0, if se~B. In mathematical contexts remote 
from the topics in this book, the term “characteristic function” has at 
least two other meanings virtually unconnected with the one at hand, 
one in connection with linear operators on function spaces, and another 
in connection with the Fourier analysis of distributions. 

Often the expected value of x is referred to as the integral of x over 
S, in which case it is generally written [x(s) dP(s). 
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Exercises 


1. If x takes only a finite number of values, 21, ---, tm, except on a 
set of probability zero; then 


(4) E(x) = >) x:P(z(s) = 23), 
ft 


that is, the average of the z,’s, each weighted by the probability of its 
occurrence. 

2. If P(x(s) < y(s)) = 0, E(x) > E(y); and if, in addition, P(a(s) > 
y(s) + e) > 0 for some e > 0, then E(x) > E(y).T 

3. If x is a real random variable, B; a partition, p; and o; real numbers 
such that p; < x(s) < o; for s ¢ B,, then 
(5) ZpiP(B:) < E(x) < 2o;P(B). 

4, c(|. A M B) = c(| A)c(| B), 

c(| ~A) = 1 — (A), 


c(| A U B) = e(| A) + ¢(| B) — (| A)e(| B). 


As is explained in texts on measure theory, the expected value can 
(at least for countably additive measures), and in practice must, be ex- 
tended to many unbounded random variables. 

Since, provided P(B) > 0, the conditional probability, defined by 
P(C| B) = P(C_N B)/P(B), is itself a probability measure, the ex- 
pectation of x with respect to a conditional probability is a meaningful 
concept. This conditional expectation is written E(x|B) and read 
“‘the expected value of x given B.”’ 


More exercises 


5. E(x| B) = E(xc(| B))/P(B). Hint: It suffices to verify that the 
expression on the right satisfies the three conditions parallel to (1-3) 
that define E(x | B). 

6. If B; 1s a partition of S, then 


(6) > c(s | B;) =1 for every s. 
7. E(x) = >> E(x| B)P(B;). Hint: Use x = Ix. 


+t Technical note: In the event that P is countably additive, P(z(s) > y(s)) > 0 
implies the existence of a suitable «, so then « need not be mentioned at all. 
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Suppose y is a (not necessarily real) random variable that takes on 
only a finite number of values. It will be understood that E(x | y) 18 
the expected value of x given that y(s) = y, provided y is such that 
this event has positive probability. Furthermore, it will be understood 
that E(x | y) 1s a bounded real random variable that for each s takes 
the value E(x| y(s)). The definition leaves E(x | y) undefined on the 
null set of those points s where y(s) is a value that y takes on with prob- 
ability zero. It is immaterial how this blemish is removed; in particu- 
lar E(x| y) may as well be set equal to 0, where it has not already been 
defined. 


Still more exercises 


8. E(E(h|y)) = E(h). 
9. If f is a real-valued function defined on the values of y; then f(y) 
is a bounded real variable, and 


(7) E(f(y)x) = E(f(y)E(®| y)). 
10. If h(x) 1s such that, for all f, 
(8) E(f(y)x) = E(f(y)h(y)), 


then h(y(s)) = E(x | y(s)), except possibly on a set of s’s of probability 
zero. 


Exercise 9 and its corollary, 8, present the most frequently used prop- 
erties of conditional expectation. Exercise 10 shows that the property 
presented in 9 characterizes conditional expectation. Through this 
characterization Kolmogoroff [K7] extends the ideas of conditional ex- 
pectation and also of conditional probability (for countably additive 
measures) to random variables y not necessarily confined to a finite or 
even denumerable set of values; though the definition in terms of ordi- 
nary conditional probability then breaks down completely, the proba- 
bility that y(s) = y often being 0 for every y. 


APPENDIX 2 


Convex Functions 


This appendix gives a brief account of convex functions in the same 
spirit as the preceding one gives an account of expected value. Reason- 
able facsimiles of the proofs omitted here are scattered through [H4], 
where they may be found by anyone not content to skip them. 

An interval is a set I of real numbers; such that, if z,z¢ 2 andzx<y 
<z,thenyelI. It is not difficult to see that intervals can be classified 
according to Table 1, where it is to be understood that xz < z. 


TABLE 1. THE VARIOUS TYPES OF INTERVALS 


The set of 
Symbolic real y’s 
designation such that Verbal description 
(—00, +00) y=y The infinite interval (the set of 
all real numbers) 
(x, +0) a< ut 
(- oo, x) r>y nee 
half-infinite intervals 
[z, ++) aS | 
(00, 2] 4 Closed 
(2, 2) r<ey<Z Open 
Zs | : s : = ‘| Half-open } bounded intervals 
Ix, z] ar<iy <2 Closed 
[x, a] r= y One-point intervals 
y<y The vacuous interval (the vacu- 


ous set) 


A real-valued function t defined for z in an interval J is convex, if 
and only if the graph of the function never rises above any chord of it- 
self. Analytically, if p and o are positive, p + o = 1, and 2, y eJ; then 


(1) t(px + oy) < pt(x) + ot(y). 
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If equality holds in (1) for some p; then, as is easily verified, it holds 
for every p, and t is linear, i.e., of the form az-+ 8, in the closed 
interval [z, y]. An interval in which t is linear will here be called an 
interval of linearity. If and only if there are no intervals of linearity 
other than the one-point and vacuous intervals, t is strictly convex. 


Exercises 


1. Verify, at least graphically, that the following functions are con- 
vex in the indicated intervals; discuss their intervals of linearity; and 
say which are strictly convex. 


[ = (—%, +0): 

(a) e° for every p, (b) x? + px + o for every p and o, 
(c) |x|, (d) |x|? for p > 1, 

(e) 2x. 

I = (0, ©): 

(f) —log z, (g) «for -—~7 <p < 0. 

I = (-—1, +1): 

(h) (1 — 2?)~%, (i) 1 — cos (2/2). 


2. In an interval where t is convex, if d7(x)/dzx? exists at z, then 
d*t(x)/dx? > 0; and if, for every z in an interval I, d?t(x)/dz? exists and 
is non-negative, then t is convex in I. 

3. Re-explore Exercise 1 in the light of 2. 

4. Let T be a non-vacuous set of functions, t, t’, ---, convex in I, 
and let 


(2) t*(s) = sup i(s). 
t 


In (2), as always in mathematics, the sup, or supremum, of a set of 
numbers is the Jeast number, possibly ©, that is not less than any ele- 
ment of the set. If ¢*(s) < © for every s¢J, then t* is convex in I. 
Explore the proposition just stated, first graphically, especially for a 
finite set of linear t’s, and then analytically. What if the elements of 
T are all strictly convex? 

5. In an open interval where t is convex, it is also continuous. What 
are the facts for closed and half-closed intervals? 
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6. If t is convex in J, x, eI, px > 0, and Zp, = 1, where k = 1, :--, 
r; then 


(3) D, pt(te) > (> pits) 
i i 


Equality obtains, if and only if all the 2x;’s are in a single interval of 
linearity of t. 


(a) Interpret the propositions above in terms of probability. 
(b) Prove them by arithmetic induction on r. 
(c) What if t is strictly convex? 


Exercise 6 suggests, and indeed proves a Special case of, the following 
well-known and most useful theorem, which cannot be proved here in 
full generality. 


THEOREM 1 If t is convex and bounded in the interval J, and x(s) eI 
for all s eS, then 


(4) E(t(x)) 2 U(x). 


Equality obtains, if and only if the values of x are with probability one 
contained in a single interval of linearity of t. Here and throughout this 
appendix, such conditions for equality are to be understood to apply 
only in the event that either P is countably additive or the random 
variable is with probability one confined to a finite set of values; the 
general situation for finitely additive measures is a little more compli- 


cated. 


More exercises 
7. The variance of x, often written V(x), is defined thus: 


(5) V(x) = E({x — E(x))?). 
Show that 
(6) V(x) = E(x’) — E*(x) > 0, 


with equality if and only if P(a(s) = H(x)) = 1. 
8. Show that, if x is never smaller than some positive number, 


(7) log H~!(x—) < E(log x) < log E(x). 


When can either equality obtain? Write the analogue of (7) suggested 
by (3), and show thereby that (7) 1s a generalization of the familiar 
fact that the arithmetic mean (of positive numbers) is at least as great 
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as the geometric mean and the geometric mean is at least as great as 
the harmonic mean. 


One of the most famous of all inequalities is the Schwartz inequality, 
which can, though not quite obviously, be derived from Theorem 1, 
and which can be stated in terms of expected values thus: 


(8) E?(xy) < E(x’)E(y’), 


with equality obtaining if and only if for some numbers p and o not 
both zero 


(9) P(px(s) = oy(s)) = 1. 


Note that (9) expresses (perhaps too compactly) that, except on some 
set of probability zero, either x or y vanishes identically or else each is 
a fixed multiple of the other. 

Statistically speaking, the Schwartz inequality expresses, in effect, 
the familiar fact that any correlation coefficient must lie between +1 
and —1, one of the extremes occurring if and only if at least one of the 
two random variables involved is a linear function of the other. 

The concept of convex functions and its implications can easily be 
extended to real-valued functions defined on vectors in an n-dimensional 
vector space, the role of intervals there being replaced by convex sub- 
sets of the vector space; but an understanding of this extension, though 
desirable, is not absolutely essential in reading this book. 

One good introduction to convex subsets of vector spaces is Sections 
16.1-2 of [V4], and another especially adapted to statistical applica- 
tions is incorporated in [B18]. The standard treatise on the topic is 
that of Bonnessen and Fenchel [B20]. 
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Bibliographic Material 


The bibliography of about 170 items that terminates this appendix 
lists not only all works referred to in this book but also some others, 
for it 1s intended to serve not only as a mechanical aid to reference but 
also as a briefly and informally annotated list of suggested readings in 
the foundations of statistics. In addition to the notes incorporated 
into the bibliography, information about many of the works listed there 
is given in other parts of the book, where it can be found by referring 
to the author’s name in the author index. References that have come to 
my attention since the first edition are in Appendix 4: Bibliographic 
Supplement. They are cited by the convention according to which the 
first of them is called (Aczél 1966). 

Todhunter has abundant references scattered in chronological order 
through [T3], emphasizing the mathematical aspects of probability up 
through the period of Laplace. Keynes, in [K4], gives a formal bibli- 
ography which purposely does not overlap Todhunter’s material very 
extensively, the emphasis being on more philosophical aspects of prob- 
ability and on the period between Laplace and Keynes. Carnap in 
[C1] also gives a formal bibliography, which emphasizes publications 
since Keynes. Carnap promises an even fuller bibliography in the 
projected second volume of his work, and he recommends the bibliog- 
raphy of Georg Henrik von Wright in [V5]. 

Bibliographies of statistics proper are of some, though diluted, rele- 
vance. Of these, the most useful is that of M. G. Kendall in Vol. II 
of [K2]. Carnap at the beginning of his bibliography gives reference to 
some other statistical bibliographies. The enormous work of O. K. Bu- 
ros In statistical bibliography, [B23], [B24], and [B25], should also be 
mentioned. His volumes bring together pointed excerpts from reviews 
of statistical books. Buros also directed a bibliographic department, 
entitled ‘Statistical Methodology,” in the Journal of the American Sta- 
tistical Association from September 1945 to September 1948, listing cur- 
rent articles, books, theses, and chapters dealing with statistics. In 
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Volume 20 (1949) of the Annals of Mathematical Statistics, an important 
journal of statistical theory, there are two cumulative indexes of Vol- 
umes 1-20, one arranged by author, the other by subject. 
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Postulates of a Personalistic 


The seven postulates (P1 through P7) scattered through the first 
five chapters of this book are reproduced here for ready reference along 
with a minimum of explanatory material. The language of the postu- 
lates is here changed somewhat for conciseness and to show an alterna- 
tive mode of expression, but the logical content of each postulate is 
left unaltered. 


The formal subject matter of the theory 
The states, a set S of elements s, s’, -» - with subsets A, B,C, --+ (page 11). 
The consequences, a set F of elements f, g, h, --+ (page 14). 


Acts, arbitrary functions f, g, h, --- from S to F (page 14). 


The relation “‘is not preferred to” between acts, < (page 18). 


The postulates, and definitions on which they depend 


Definitions of terms not in general mathematical use are given here 
as D1 through D5; for others consult the General Index (page 289) 
and the Technical Symbols (page 283). 


Pl The relation < is a simple ordering (page 18). 


D1 f <g given B, if and only if f’ < g’ for every f’ and g’ that 
agree with f and g, respectively, on B and with each other on ~B 
and g’ < f’ either for all such pairs or for none (page 22). 
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P2  ~=For every f, g, and B, f < g given B org < f given B (page 23). 
D2 g<q’;if and only if f < f’, when f(s) = g, f’(s) = g’ for every 
s¢eS (page 25). 

D3 __ iB is null, if and only if f < g given B for every f, g (page 24). 


P3 If f(s) = g, f’(s) = g’ for every se B, and B is not null; then 
f < f’ given B, if and only if g < g’ (page 26). 


D4 A <B; if and only if f4 < fp or g < g’ for every fa, fz, g, g’ 
such that: fa(s) = g for se A, fa(s) = g’ for se~A, fpe(s) = g, for 
se B, fg(s) = g’ for se ~B (page 31). 

P4 Forevery A, B, A <BorB<A (page 31). 

P5 It is false that, for every f, f’, f <f’ (page 31). 


P6 Suppose it false that g < h; then, for every f, there is a (finite) 
partition of S such that, if g’ agrees with g and h’ agrees with h except 
on an arbitrary element of the partition, g’ and h’ being equal to f 
there, then it will be false that g’ < h or g < h’ (page 39). 


D5 f <g given B (g <f given B); if and only if f <h given B 
(h < f given B), when h(s) = g for every s (page 72). 


P7 If f < g(s) given B (g(s) <f given B) for every s¢B, then 
f < g given B (g < f given B) (page 77). 


